sentencepiece-0.1.96/ 0000755 0001750 0000176 00000000000 14062671741 013764 5 ustar kenhys docker sentencepiece-0.1.96/doc/ 0000755 0001750 0000176 00000000000 14062671741 014531 5 ustar kenhys docker sentencepiece-0.1.96/doc/special_symbols.md 0000644 0001750 0000176 00000002107 14062671741 020243 0 ustar kenhys docker # Use custom symbols
SentencePiece model supports two types of special symbols.
## Control symbol
Control symbols are used to encode special indicators for the decoder to change the behavior dynamically.
Example includes the language indicators in multi-lingual models. `` and ` ` are reserved control symbols.
Control symbols must be inserted outside of the SentencePiece segmentation. Developers need to take the responsibility to insert these symbols in data generation and decoding.
It is guaranteed that control symbols have no corresponding surface strings in the original user input. Control symbols are decoded into empty strings.
## User defined symbol
User defined symbol is handled as one piece in any context. If this symbol is included in the input text, this symbol is always extracted as one piece.
## Specify special symbols in training time
Use `--control_symbols` and `--user_defined_symbols` flags as follows
```
% spm_train --control_symbols=, --user_defined_symbols=, --input= --model_prefix= --vocab_size=8000
```
sentencepiece-0.1.96/doc/api.md 0000644 0001750 0000176 00000011274 14062671741 015631 0 ustar kenhys docker # SentencePieceProcessor C++ API
## Load SentencePiece model
To start working with the SentencePiece model, you will want to include the `sentencepiece_processor.h` header file.
Then instantiate sentencepiece::SentencePieceProcessor class and calls `Load` method to load the model with file path or std::istream.
```C++
#include
sentencepiece::SentencePieceProcessor processor;
const auto status = processor.Load("//path/to/model.model");
if (!status.ok()) {
std::cerr << status.ToString() << std::endl;
// error
}
// You can also load a model from std::ifstream.
// std::ifstream in("//path/to/model.model");
// auto status = processor.Load(in);
```
## Tokenize text (preprocessing)
Calls `SentencePieceProcessor::Encode` method to tokenize text.
```C++
std::vector pieces;
processor.Encode("This is a test.", &pieces);
for (const std::string &token : pieces) {
std::cout << token << std::endl;
}
```
You will obtain the sequence of vocab ids as follows:
```C++
std::vector ids;
processor.Encode("This is a test.", &ids);
for (const int id : ids) {
std::cout << id << std::endl;
}
```
## Detokenize text (postprocessing)
Calls `SentencePieceProcessor::Decode` method to detokenize a sequence of pieces or ids into a text. Basically it is guaranteed that the detokenization is an inverse operation of Encode, i.e., `Decode(Encode(Normalize(input))) == Normalize(input)`.
```C++
std::vector pieces = { "▁This", "▁is", "▁a", "▁", "te", "st", "." }; // sequence of pieces
std::string text
processor.Decode(pieces, &text);
std::cout << text << std::endl;
std::vector ids = { 451, 26, 20, 3, 158, 128, 12 }; // sequence of ids
processor.Decode(ids, &text);
std::cout << text << std::endl;
```
## Sampling (subword regularization)
Calls `SentencePieceProcessor::SampleEncode` method to sample one segmentation.
```C++
std::vector pieces;
processor.SampleEncode("This is a test.", &pieces, -1, 0.2);
std::vector ids;
processor.SampleEncode("This is a test.", &ids, -1, 0.2);
```
SampleEncode has two sampling parameters, `nbest_size` and `alpha`, which correspond to `l` and `alpha` in the [original paper](https://arxiv.org/abs/1804.10959). When `nbest_size` is -1, one segmentation is sampled from all hypothesis with forward-filtering and backward sampling algorithm.
## Training
Calls `SentencePieceTrainer::Train` function to train sentencepiece model. You can pass the same parameters of [spm_train](https://github.com/google/sentencepiece#train-sentencepiece-model) as a single string.
```C++
#include
sentencepiece::SentencePieceTrainer::Train("--input=test/botchan.txt --model_prefix=m --vocab_size=1000");
```
## SentencePieceText proto
You will want to use `SentencePieceText` class to obtain the pieces and ids at the same time. This proto also encodes a utf8-byte offset of each piece over user input or detokenized text.
```C++
#include
sentencepiece::SentencePieceText spt;
// Encode
processor.Encode("This is a test.", &spt);
std::cout << spt.text() << std::endl; // This is the same as the input.
for (const auto &piece : spt.pieces()) {
std::cout << piece.begin() << std::endl; // beginning of byte offset
std::cout << piece.end() << std::endl; // end of byte offset
std::cout << piece.piece() << std::endl; // internal representation.
std::cout << piece.surface() << std::endl; // external representation. spt.text().substr(begin, end - begin) == surface().
std::cout << piece.id() << std::endl; // vocab id
}
// Decode
processor.Decode({10, 20, 30}, &spt);
std::cout << spt.text() << std::endl; // This is the same as the decoded string.
for (const auto &piece : spt.pieces()) {
// the same as above.
}
```
## Vocabulary management
You will want to use the following methods to obtain ids from/to pieces.
```C++
processor.GetPieceSize(); // returns the size of vocabs.
processor.PieceToId("foo"); // returns the vocab id of "foo"
processor.IdToPiece(10); // returns the string representation of id 10.
processor.IsUnknown(0); // returns true if the given id is an unknown token. e.g.,
processor.IsControl(10); // returns true if the given id is a control token. e.g., ,
```
## Extra Options
Use `SetEncodeExtraOptions` and `SetDecodeExtraOptions` methods to set extra options for encoding and decoding respectively. These methods need to be called just after `Load` methods.
```C++
processor.SetEncodeExtraOptions("bos:eos"); // add and .
processor.SetEncodeExtraOptions("reverse:bos:eos"); // reverse the input and then add and .
processor.SetDecodeExtraOptions("reverse"); // the decoder's output is reversed.
```
sentencepiece-0.1.96/doc/normalization.md 0000644 0001750 0000176 00000006152 14062671741 017745 0 ustar kenhys docker # Use custom normalization rule
By default, SentencePiece normalizes the input sentence with a variant of Unicode
[NFKC](https://en.wikipedia.org/wiki/Unicode_equivalence).
SentencePiece allows us to define custom normalization rule, which is stored in the model file.
## Use pre-defined normalization rule
SentencePiece provides the following pre-defined normalization rule. It is recommended to use one of them unless you have any special reasons.
* **nmt_nfkc**: [NFKC](https://en.wikipedia.org/wiki/Unicode_equivalence) normalization with some additional normalization around spaces. (default)
* **nfkc**: original NFKC normalization.
* **nmt_nfkc_cf**: nmt_nfkc + [Unicode case folding](https://www.w3.org/International/wiki/Case_folding) (mostly lower casing)
* **nfkc_cf**: nfkc + [Unicode case folding](https://www.w3.org/International/wiki/Case_folding).
* **identity**: no normalization
You can choose the normalization rule with `--normalization_rule_name` flag.
```
% spm_train --normalization_rule_name=identity --input= --model_prefix= --vocab_size=8000
```
NOTE: Due to the limitation of normalization algorithm, full NFKC normalization is not implemented. [builder.h] describes example character sequences not normalized by our NFKC implementation.
The difference between **nmt_nfkc** and **nfkc** can be found via ```diff -u data/nfkc.tsv data/nmt_nfkc.tsv``` command.
## Use custom normalization rule
The normalization is performed with user-defined string-to-string mappings and leftmost longest matching.
You can use custom normalization rule by preparing a TSV file formatted as follows:
```
41 302 300 1EA6
41 302 301 1EA4
41 302 303 1EAA
...
```
In this sample, UCS4 sequence [41 302 300] (hex) is converted into [1EA6] (hex). When there are ambiguities in the conversions, the longest rule is used.
Note that the tab is used as a delimiter for source and target sequence and space is used as a delimiter for UCS4 characters. We can make the target sequence empty to remove some specific characters from the text.
See [data/nfkc.tsv](data/nfkc.tsv) as an example. Once a TSV file is prepared, you can specify it with `--normalization_rule_tsv` flag.
```
% spm_train --normalization_rule_tsv= --input= --model_prefix= --vocab_size=8000
```
`` embeds the normalization rule so the same normalization rule is applied when `` is used.
## Command line tool to perform normalization
```
% spm_normalize --model= file1 file2..
% spm_normalize --normalizatoin_rule_tsv=custom.tsv file1 file2..
```
The first command line uses the normalization rule embedded in the model file. The second command line uses the normalization rule in TSV file and is useful to make normalization rule interactively.
sentencepiece-0.1.96/doc/experiments.md 0000644 0001750 0000176 00000021602 14062671741 017417 0 ustar kenhys docker # SentencePiece Experiments
## Experiments 1 (subword vs word-based model)
### Experimental settings
* Segmentation algorithms:
* **SentencePiece**: SentencePiece with a language-model based segmentation. (`--model_type=unigram`)
* **SentencePeice(BPE)**: SentencePiece with Byte Pair Encoding. [[Sennrich et al.](http://www.aclweb.org/anthology/P16-1162)]] (`--model_type=bpe`)
* **Moses**: [Moses tokenizer](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl) for English.
* **KyTea**: [KyTea](http://www.phontron.com/kytea/) for Japanese.
* **MeCab**: [MeCab](http://taku910.github.io/mecab/) for Japanese.
* **neologd**: [MeCab with neologd](https://github.com/neologd/mecab-ipadic-neologd) for Japanese.
* **(Moses/KyTea)+SentencePiece**: Apply SentencePiece (Unigram) to pre-tokenized sentences. We have several variants with different tokenizers., e.g., **(Moses/MeCab)+SentencePiece**, **(MeCab/Moses)+SentencePiece**.
* *char**: Segments sentence by characters.
* Data sets:
* [KFTT](http://www.phontron.com/kftt/index.html)
* NMT parameters: ([Google’s Neural Machine Translation System](https://arxiv.org/pdf/1609.08144.pdf) is applied for all experiments.)
* Dropout prob: 0.2
* num nodes: 512
* num lstms: 6
* Decoder parameters (α and β) are optimized with development data.
* Evaluation metrics:
* Case-sensitive BLEU on detokenized text with NIST scorer and KyTea segmenter. Used in-house rule-based detokenizer for Moses/KyTea/MeCab/neologd.
### Results (BLEU scores)
#### English to Japanese
|Setting|vocab size|BLEU(dev)|BLEU(test)|src #tokens/sent.|trg #tokens/sent.|
|:---|---:|---:|---:|---:|---:|
|SentencePiece|4k (shared)|0.2857|0.2940|43.7478|29.6998|
|SentencePiece|8k (shared)|0.2785|0.2955|30.9734|25.0540|
|SentencePiece|16k (shared)|0.2664|0.2862|27.1827|21.5326|
|SentencePiece|32k (shared)|0.2641|0.2849|25.0592|19.0840|
|SentencePiece(BPE)|8k (shared)|0.2767|0.2947|31.7693|25.4331|
|(Moses/KyTea)+SentencePiece|8k (shared)|0.2900|0.2985|31.2719|29.9854|
|(Moses/MeCab)+SentencePiece|8k (shared)|0.2817|0.2950|31.4743|28.9537|
|(Moses/neologd)+SentencePiece|8k (shared)|0.2824|**0.3062**|31.2985|28.8645|
|Moses/Kytea|80k/80k|0.2576|0.2824|21.2513|23.2161|
|Moses/MeCab|80k/80k|0.2455|0.2780|21.2513|21.2033|
|Moses/neologd|80k/80k|0.2157|0.2378|21.2513|18.4768|
|Moses/SentencePiece|80k/8k|0.2475|0.2742|21.2513|22.9383|
|SentencePiece/KyTea|8k/80k|0.2778|0.2918|27.0429|23.2161|
|SentencePiece/MeCab|8k/80k|0.2673|0.2919|27.0429|21.2033|
|SentencePiece/neolgod|8k80k|0.2280|0.2494|27.0429|18.4768|
|Char|3k (shared)|0.2509|0.2679|109.8662|33.6963|
#### Japanese to English
|Setting|vocab size|BLEU(dev)|BLEU(test)|src #tokens/sent.|trg #tokens/sent.|
|:---|---:|---:|---:|---:|---:|
|SentencePiece|4k (shared)|0.1970|**0.2179**|29.6998|43.7478|
|SentencePiece|8k (shared)|0.1966|0.2162|25.0540|30.9734|
|SentencePiece|16k (shared)|0.1996|0.2160|21.5326|27.1827|
|SentencePiece|32k (shared)|0.1949|0.2159|19.0840|25.0592|
|SentencePiece(BPE)|8k (shared)|0.1977|0.2173|25.4331|31.7693|
|(KyTea/Moses)+SentencePiece|8k (shared)|0.1921|0.2086|29.9854|31.2719|
|(MeCab/Moses)+SentencePiece|8k (shared)|0.1909|0.2049|28.9537|31.4743|
|(neologd/Moses)+SentencePiece|8k (shared)|0.1938|0.2137|28.8645|31.2985|
|KyTea/Moses|80k/80k|0.1707|0.2006|23.2161|21.2513|
|MeCab/Moses|80k/80k|0.1668|0.1892|21.2033|21.2513|
|neologd/Moses|80k/80k|0.1589|0.1836|18.4768|21.2513|
|SentencePiece/Moses|8k/80k|0.1727|0.1994|22.9383|21.2513|
|KyTea/SentencePiece|80k/8k|0.1939|0.2141|23.2161|27.0429|
|MeCab/SentencePiece|80k/8k|0.1892|0.2077|21.2033|27.0429|
|neologd/SentencePiece|80k/8k|0.1641|0.1804|18.4768|27.0429|
|Char|3k (shared)|0.0824|0.0918|33.6963|109.8662|
#### Discussion
* **SentencePiece (Unigram/BPE)** outperforms word-based methods **(Moses/KyTea/MeCab/neologd)** even with a smaller vocabulary (10% of word-based methods).
* The number of tokens to represent Japanese sentences is almost comparable between **SentencePiece (unigram)** and **KyTea**, though the vocabulary of **SentencePiece** is much smaller. It implies that Sentencepiece can effectively compress the sentences with a smaller vocabulary set.
* Pretokenization can slightly improve the BLEU scores in English to Japanese. In Japanese to English translation, pretokenization doesn't help to improve BLEU.
* **Neologd** shows poor BLEU score. Tokenizing sentences with a large named entity dictionary might not be effective in neural-based text processing.
* **SentencePiece(Unigram)** shows slightly better text compression ratio than **BPE**, but no significant differences in BLEU score.
* The selection of vocabulary size for SentencePiece is sensitive in English to Japanese. This is probably because the vocabulary size will drastically affect the tokenization results in Japanese which has no explicit spaces between words.
## Experiments 2 (subwording with various pre-tokenizations)
### Experimental settings
We have evaluated SentencePiece segmentation with the following configurations.
* Segmentation algorithms:
* **BPE** (Byte Pair
Encoding) [[Sennrich et al.](http://www.aclweb.org/anthology/P16-1162)]] (`--model_type=bpe`)
* **Unigram**. Language-model based segmentation. (`--model_type=unigram`)
* pretokenization methods:
* **NoPretok**: No pretokenization. We train SentencePiece directly from
raw sentences (`--split_by_whitespace=false`).
* **WsPretok**: Trains SentencePiece model from the sentences tokenized by
whitespaces (`--split_by_whitespace=true`). When handling CJK, this setting is almost equivalent to **NoPretok**.
* **MosesPretok**: Trains SentencePiece model from sentences tokenized
by [Moses tokenizer](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl). We used [KyTea](http://www.phontron.com/kytea/) for
Japanese and in-house segmenters for Korean and Chinese respectively.
* NMT parameters: ([Google’s Neural Machine Translation System](https://arxiv.org/pdf/1609.08144.pdf) is applied for all experiments.)
* 16k shared vocabulary (Shares the same vocabulary for source and
target. We train single SentencePiece model by concatenating raw source
and target sentences.)
* Dropout prob: 0.2
* num nodes: 512
* num lstms: 8
* Evaluation metrics:
* Case-sensitive BLEU on detokenized text with NIST scorer.
* For CJK, the same word segmenters are applied prior to NIST scorer.
* No detokenizer is applied for **NoPretok** and **WsPretok**, which can
directly emit detokenized sentences.
* Applied [Moses detokenizer](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/detokenizer.perl) and in-house rule-based detokenizer (CJK) for **MosesPretok**.
* Data sets:
* [KFTT](http://www.phontron.com/kftt/index.html)
* [MultiUN](http://opus.lingfil.uu.se/MultiUN.php) (First 5M and next
5k/5k sentences are used for training and development/testing respectively.)
* [WMT16](http://www.statmt.org/WMT16/)
* In-house: (Used 5M parallel sentences for training)
**NoPretok** and **WsPretok** do not use any language-dependent resources.
**BPE**+**MosePretok** is almost the same configuration used in [[Sennrich et al.](http://www.aclweb.org/anthology/P16-1162)] and [[Wu et al.](https://arxiv.org/pdf/1609.08144.pdf)].
### Results (BLEU scores)
|Language Pair|BPE(NoPretok)|BPE(WsPretok)|BPE(MosesPretok)|Unigram(NoPretok)|Unigram(WsPretok)|Unigram(MosesPretok)
|---|---|---|---|---|---|---|
|KFTT en-ja| 0.2796| 0.281| 0.286| 0.2806| 0.280| 0.2871|
|KFTT ja-en| 0.1943| 0.208| 0.1967| 0.1985| 0.2148| 0.198|
|MultiUN ar-en| 0.5268| 0.5414| 0.5381| 0.5317| 0.5449| 0.5401|
|MultiUN en-ar| 0.4039| 0.4147| 0.4012| 0.4084| 0.4172| 0.3991|
|MultiUN en-zh| 0.4155| 0.4186| 0.395| 0.4214| 0.4165| 0.399|
|MultiUN zh-en| 0.46| 0.4716| 0.4806| 0.4644| 0.4711| 0.4759|
|In house en-ko| 0.178| 0.1851| 0.1893| 0.1846| 0.1872| 0.1890|
|In house ko-en| 0.1786| 0.1954| 0.1994| 0.1845| 0.1956| 0.2015|
|WMT16 cs-en| 0.1987| 0.2252| 0.2231| 0.2164| 0.2228| 0.2238|
|WMT16 de-en| 0.3194| 0.3348| 0.3374| 0.3261| 0.3375| 0.3398|
|WMT16 en-cs| 0.1607| 0.1827| 0.1812| 0.1722| 0.1778| 0.179|
|WMT16 en-de| 0.2847| 0.3029| 0.3013| 0.2946| 0.3000| 0.3053|
|WMT16 en-fi| 0.1434| 0.1528| 0.1499| 0.1472| 0.1568| 0.1517|
|WMT16 en-ru| 0.1884| 0.1973| 0.1989| 0.19| 0.1982| 0.1903|
|WMT16 fi-en| 0.1775| 0.1867| 0.1877| 0.182| 0.1882| 0.1865|
|WMT16 ru-en| 0.2042| 0.2229| 0.2194| 0.2087| 0.2201| 0.2155|
* **MosesPretok** does not always improve BLEU scores. Comparable
accuracy can be obtained without using language-dependent resources in many
language pairs.
* Whitespace pretokenization is a reasonable choice. It does not use language-specific resources.
* **NoPretok** shows poor BLEU scores. Unigrams are more robust than BPE when no pretokenizer is applied.
sentencepiece-0.1.96/doc/options.md 0000644 0001750 0000176 00000010552 14062671741 016551 0 ustar kenhys docker # Training options
The training options for the `spm_train` can be listed using `spm_train --help`. Since the standard `pip install` of sentencepiece does not necessarily install `spm_train`, the options are also listed here.
```
--help (show help) type: bool default: false
--version (show version) type: bool default: false
--minloglevel (Messages logged at a lower level than this don't actually get logged anywhere) type: int default: 0
--input (comma separated list of input sentences) type: std::string default: ""
--input_format (Input format. Supported format is `text` or `tsv`.) type: std::string default: ""
--model_prefix (output model prefix) type: std::string default: "" --model_type (model algorithm: unigram, bpe, word or char) type: std::string default: "unigram"
--vocab_size (vocabulary size) type: int32 default: 8000
--accept_language (comma-separated list of languages this model can accept) type: std::string default: ""
--self_test_sample_size (the size of self test samples) type: int32 default: 0
--character_coverage (character coverage to determine the minimum symbols) type: double default: 0.9995
--input_sentence_size (maximum size of sentences the trainer loads) type: int32 default: 0
--shuffle_input_sentence (Randomly sample input sentences in advance. Valid when --input_sentence_size > 0) type: bool default: true
--seed_sentencepiece_size (the size of seed sentencepieces) type: int32 default: 1000000
--shrinking_factor (Keeps top shrinking_factor pieces with respect to the loss) type: double default: 0.75
--num_threads (number of threads for training) type: int32 default: 16
--num_sub_iterations (number of EM sub-iterations) type: int32 default: 2
--max_sentencepiece_length (maximum length of sentence piece) type: int32 default: 16
--max_sentence_length (maximum length of sentence in byte) type: int32 default: 4192
--split_by_unicode_script (use Unicode script to split sentence pieces) type: bool default: true
--split_by_number (split tokens by numbers (0-9)) type: bool default: true
--split_by_whitespace (use a white space to split sentence pieces) type: bool default: true
--split_digits (split all digits (0-9) into separate pieces) type: bool default: false
--treat_whitespace_as_suffix (treat whitespace marker as suffix instead of prefix.) type: bool default: false
--control_symbols (comma separated list of control symbols) type: std::string default: ""
--user_defined_symbols (comma separated list of user defined symbols) type: std::string default: ""
--required_chars (UTF8 characters in this flag are always used in the character set regardless of --character_coverage) type: std::string default: ""
--byte_fallback (decompose unknown pieces into UTF-8 byte pieces) type: bool default: false
--vocabulary_output_piece_score (Define score in vocab file) type: bool default: true
--normalization_rule_name (Normalization rule name. Choose from nfkc or identity) type: std::string default: "nmt_nfkc"
--normalization_rule_tsv (Normalization rule TSV file. ) type: std::string default: ""
--denormalization_rule_tsv (Denormalization rule TSV file.) type: std::string default: ""
--add_dummy_prefix (Add dummy whitespace at the beginning of text) type: bool default: true
--remove_extra_whitespaces (Removes leading, trailing, and duplicate internal whitespace) type: bool default: true
--hard_vocab_limit (If set to false, --vocab_size is considered as a soft limit.) type: bool default: true
--use_all_vocab (If set to true, use all tokens as vocab. Valid for word/char models.) type: bool default: false
--unk_id (Override UNK () id.) type: int32 default: 0
--bos_id (Override BOS () id. Set -1 to disable BOS.) type: int32 default: 1
--eos_id (Override EOS ( ) id. Set -1 to disable EOS.) type: int32 default: 2
--pad_id (Override PAD () id. Set -1 to disable PAD.) type: int32 default: -1
--unk_piece (Override UNK () piece.) type: std::string default: ""
--bos_piece (Override BOS () piece.) type: std::string default: ""
--eos_piece (Override EOS ( ) piece.) type: std::string default: " "
--pad_piece (Override PAD () piece.) type: std::string default: ""
--unk_surface (Dummy surface string for . In decoding is decoded to `unk_surface`.) type: std::string default: " ⁇ "
--train_extremely_large_corpus (Increase bit depth for unigram tokenization.) type: bool default: false
```
sentencepiece-0.1.96/tensorflow/ 0000755 0001750 0000176 00000000000 14062671741 016166 5 ustar kenhys docker sentencepiece-0.1.96/tensorflow/README.md 0000644 0001750 0000176 00000002755 14062671741 017456 0 ustar kenhys docker # SentencePiece TensorFlow module
## WARNING
tf_sentencepiece is going to be deprecated in tensorflow 2.3.0. tf_sentencepiece for tensorflow 2.2.0x is the last release of tf_sentencepiece. Use [tensoflow_text](https://github.com/tensorflow/text) to run sentencepiece on tensorflow.
Example
```Python
import tensorflow as tf
import tensorflow_text as text
model = open('test_model.model', 'rb').read()
s1 = text.SentencepieceTokenizer(model=model)
print(s1.tokenize(['hello world']))
print(s1.tokenize_with_offsets(['hello world']))
s2 = text.SentencepieceTokenizer(model=model, out_type=tf.dtypes.string)
print(s2.tokenize(['hello world']))
print(s2.tokenize_with_offsets(['hello world']))
```
## Introduction
SentencePiece TensorFlow module implements the encode (text to id/piece) and decode (id/piece to text) operations which are executed lazily on top of TensorFlow's Session mechanism. This module allows to make an end-to-end training/inference computatation graph by directly feeding raw sentences with the tf.placeholder.
The SentencePiece model (model proto) is passed as an attribute of the TensorFlow operation
and embedded into the TensorFlow graph so the model and graph become purely self-contained.
## Build and Install SentencePiece
For Linux (x64), macOS environment:
```
% pip install tf_sentencepiece
```
## Usage
Use pydoc to see the usage instruction
```
% pydoc sentencepiece_processor_ops
```
[Sample code](https://colab.research.google.com/drive/1rQ0tgXmHv02sMO6VdTO0yYaTvc1Yv1yP)
sentencepiece-0.1.96/tensorflow/.gitignore 0000644 0001750 0000176 00000000042 14062671741 020152 0 ustar kenhys docker build/
sdist/
dist/
tmp/
*py[cod]
sentencepiece-0.1.96/config.h.in 0000644 0001750 0000176 00000000251 14062671741 016005 0 ustar kenhys docker #ifndef CONFIG_H_
#define CONFIG_H_
#define VERSION "@PROJECT_VERSION@"
#define PACKAGE "@PROJECT_NAME@"
#define PACKAGE_STRING "@PROJECT_NAME@"
#endif // CONFIG_H_
sentencepiece-0.1.96/appveyor.yml 0000644 0001750 0000176 00000001515 14062671741 016356 0 ustar kenhys docker version: '{branch} build {build}'
image: Visual Studio 2019
platform:
- x64
- Win32
configuration: Release
clone_depth: 50
clone_folder: c:\projects\sentencepiece
#init:
# - ps: iex ((new-object net.webclient).DownloadString('https://raw.githubusercontent.com/appveyor/ci/master/scripts/enable-rdp.ps1'))
#on_finish:
# - ps: $blockRdp = $true; iex ((new-object net.webclient).DownloadString('https://raw.githubusercontent.com/appveyor/ci/master/scripts/enable-rdp.ps1'))
build_script:
- cmd: call test.bat %platform%
artifacts:
- path: build\sentencepiece*.7z
- path: python\dist\*.whl
deploy:
description: 'SentencePiece Windows release'
provider: GitHub
auth_token:
secure: Aq4jHo/HY6WFFKs1h9cCWfi3U4ZsVTooUEhtgBfcJM6SUhnZdPVazIcKCtiR32kc
draft: false
prerelease: false
on:
branch: master
appveyor_repo_tag: true
sentencepiece-0.1.96/README.md 0000644 0001750 0000176 00000037702 14062671741 015254 0 ustar kenhys docker # SentencePiece
[](https://travis-ci.org/google/sentencepiece)
[](https://ci.appveyor.com/project/taku910/sentencepiece)
[](https://coveralls.io/github/google/sentencepiece?branch=master)
[](https://github.com/google/sentencepiece/issues)
[](https://app.codacy.com/app/taku910/sentencepiece?utm_source=github.com&utm_medium=referral&utm_content=google/sentencepiece&utm_campaign=Badge_Grade_Dashboard)
[](https://badge.fury.io/py/sentencepiece)
[](https://pypi.org/project/sentencepiece/)
[](CONTRIBUTING.md)
[](https://opensource.org/licenses/Apache-2.0)
SentencePiece is an unsupervised text tokenizer and detokenizer mainly for
Neural Network-based text generation systems where the vocabulary size
is predetermined prior to the neural model training. SentencePiece implements
**subword units** (e.g., **byte-pair-encoding (BPE)** [[Sennrich et al.](http://www.aclweb.org/anthology/P16-1162)]) and
**unigram language model** [[Kudo.](https://arxiv.org/abs/1804.10959)])
with the extension of direct training from raw sentences. SentencePiece allows us to make a purely end-to-end system that does not depend on language-specific pre/postprocessing.
**This is not an official Google product.**
## Technical highlights
- **Purely data driven**: SentencePiece trains tokenization and detokenization
models from sentences. Pre-tokenization ([Moses tokenizer](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl)/[MeCab](http://taku910.github.io/mecab/)/[KyTea](http://www.phontron.com/kytea/)) is not always required.
- **Language independent**: SentencePiece treats the sentences just as sequences of Unicode characters. There is no language-dependent logic.
- **Multiple subword algorithms**: **BPE** [[Sennrich et al.](http://www.aclweb.org/anthology/P16-1162)] and **unigram language model** [[Kudo.](https://arxiv.org/abs/1804.10959)] are supported.
- **Subword regularization**: SentencePiece implements subword sampling for [subword regularization](https://arxiv.org/abs/1804.10959) and [BPE-dropout](https://arxiv.org/abs/1910.13267) which help to improve the robustness and accuracy of NMT models.
- **Fast and lightweight**: Segmentation speed is around 50k sentences/sec, and memory footprint is around 6MB.
- **Self-contained**: The same tokenization/detokenization is obtained as long as the same model file is used.
- **Direct vocabulary id generation**: SentencePiece manages vocabulary to id mapping and can directly generate vocabulary id sequences from raw sentences.
- **NFKC-based normalization**: SentencePiece performs NFKC-based text normalization.
For those unfamiliar with SentencePiece as a software/algorithm, one can read [a gentle introduction here](https://medium.com/@jacky2wong/understanding-sentencepiece-under-standing-sentence-piece-ac8da59f6b08).
## Comparisons with other implementations
|Feature|SentencePiece|[subword-nmt](https://github.com/rsennrich/subword-nmt)|[WordPiece](https://arxiv.org/pdf/1609.08144.pdf)|
|:---|:---:|:---:|:---:|
|Supported algorithm|BPE, unigram, char, word|BPE|BPE*|
|OSS?|Yes|Yes|Google internal|
|Subword regularization|[Yes](#subword-regularization)|No|No|
|Python Library (pip)|[Yes](python/README.md)|No|N/A|
|C++ Library|[Yes](doc/api.md)|No|N/A|
|Pre-segmentation required?|[No](#whitespace-is-treated-as-a-basic-symbol)|Yes|Yes|
|Customizable normalization (e.g., NFKC)|[Yes](doc/normalization.md)|No|N/A|
|Direct id generation|[Yes](#end-to-end-example)|No|N/A|
Note that BPE algorithm used in WordPiece is slightly different from the original BPE.
## Overview
### What is SentencePiece?
SentencePiece is a re-implementation of **sub-word units**, an effective way to alleviate the open vocabulary
problems in neural machine translation. SentencePiece supports two segmentation algorithms, **byte-pair-encoding (BPE)** [[Sennrich et al.](http://www.aclweb.org/anthology/P16-1162)] and **unigram language model** [[Kudo.](https://arxiv.org/abs/1804.10959)]. Here are the high level differences from other implementations.
#### The number of unique tokens is predetermined
Neural Machine Translation models typically operate with a fixed
vocabulary. Unlike most unsupervised word segmentation algorithms, which
assume an infinite vocabulary, SentencePiece trains the segmentation model such
that the final vocabulary size is fixed, e.g., 8k, 16k, or 32k.
Note that SentencePiece specifies the final vocabulary size for training, which is different from
[subword-nmt](https://github.com/rsennrich/subword-nmt) that uses the number of merge operations.
The number of merge operations is a BPE-specific parameter and not applicable to other segmentation algorithms, including unigram, word and character.
#### Trains from raw sentences
Previous sub-word implementations assume that the input sentences are pre-tokenized. This constraint was required for efficient training, but makes the preprocessing complicated as we have to run language dependent tokenizers in advance.
The implementation of SentencePiece is fast enough to train the model from raw sentences. This is useful for training the tokenizer and detokenizer for Chinese and Japanese where no explicit spaces exist between words.
#### Whitespace is treated as a basic symbol
The first step of Natural Language processing is text tokenization. For
example, a standard English tokenizer would segment the text "Hello world." into the
following three tokens.
> [Hello] [World] [.]
One observation is that the original input and tokenized sequence are **NOT
reversibly convertible**. For instance, the information that is no space between
“World” and “.” is dropped from the tokenized sequence, since e.g., `Tokenize(“World.”) == Tokenize(“World .”)`
SentencePiece treats the input text just as a sequence of Unicode characters. Whitespace is also handled as a normal symbol. To handle the whitespace as a basic token explicitly, SentencePiece first escapes the whitespace with a meta symbol "▁" (U+2581) as follows.
> Hello▁World.
Then, this text is segmented into small pieces, for example:
> [Hello] [▁Wor] [ld] [.]
Since the whitespace is preserved in the segmented text, we can detokenize the text without any ambiguities.
```
detokenized = ''.join(pieces).replace('▁', ' ')
```
This feature makes it possible to perform detokenization without relying on language-specific resources.
Note that we cannot apply the same lossless conversions when splitting the
sentence with standard word segmenters, since they treat the whitespace as a
special symbol. Tokenized sequences do not preserve the necessary information to restore the original sentence.
* (en) Hello world. → [Hello] [World] [.] \(A space between Hello and World\)
* (ja) こんにちは世界。 → [こんにちは] [世界] [。] \(No space between こんにちは and 世界\)
#### Subword regularization and BPE-dropout
Subword regularization [[Kudo.](https://arxiv.org/abs/1804.10959)] and BPE-dropout [Provilkov et al](https://arxiv.org/abs/1910.13267) are simple regularization methods
that virtually augment training data with on-the-fly subword sampling, which helps to improve the accuracy as well as robustness of NMT models.
To enable subword regularization, you would like to integrate SentencePiece library
([C++](doc/api.md#sampling-subword-regularization)/[Python](python/README.md)) into the NMT system to sample one segmentation for each parameter update, which is different from the standard off-line data preparations. Here's the example of [Python library](python/README.md). You can find that 'New York' is segmented differently on each ``SampleEncode (C++)`` or ``encode with enable_sampling=True (Python)`` calls. The details of sampling parameters are found in [sentencepiece_processor.h](src/sentencepiece_processor.h).
```
>>> import sentencepiece as spm
>>> s = spm.SentencePieceProcessor(model_file='spm.model')
>>> for n in range(5):
... s.encode('New York', out_type=str, enable_sampling=True, alpha=0.1, nbest_size=-1)
...
['▁', 'N', 'e', 'w', '▁York']
['▁', 'New', '▁York']
['▁', 'New', '▁Y', 'o', 'r', 'k']
['▁', 'New', '▁York']
['▁', 'New', '▁York']
```
## Installation
### Python module
SentencePiece provides Python wrapper that supports both SentencePiece training and segmentation.
You can install Python binary package of SentencePiece with.
```
% pip install sentencepiece
```
For more detail, see [Python module](python/README.md)
### Build and install SentencePiece command line tools from C++ source
The following tools and libraries are required to build SentencePiece:
* [cmake](https://cmake.org/)
* C++11 compiler
* [gperftools](https://github.com/gperftools/gperftools) library (optional, 10-40% performance improvement can be obtained.)
On Ubuntu, the build tools can be installed with apt-get:
```
% sudo apt-get install cmake build-essential pkg-config libgoogle-perftools-dev
```
Then, you can build and install command line tools as follows.
```
% git clone https://github.com/google/sentencepiece.git
% cd sentencepiece
% mkdir build
% cd build
% cmake ..
% make -j $(nproc)
% sudo make install
% sudo ldconfig -v
```
On OSX/macOS, replace the last command with `sudo update_dyld_shared_cache`
### Build and install using vcpkg
You can download and install sentencepiece using the [vcpkg](https://github.com/Microsoft/vcpkg) dependency manager:
git clone https://github.com/Microsoft/vcpkg.git
cd vcpkg
./bootstrap-vcpkg.sh
./vcpkg integrate install
./vcpkg install sentencepiece
The sentencepiece port in vcpkg is kept up to date by Microsoft team members and community contributors. If the version is out of date, please [create an issue or pull request](https://github.com/Microsoft/vcpkg) on the vcpkg repository.
## Usage instructions
### Train SentencePiece Model
```
% spm_train --input= --model_prefix= --vocab_size=8000 --character_coverage=1.0 --model_type=
```
* `--input`: one-sentence-per-line **raw** corpus file. No need to run
tokenizer, normalizer or preprocessor. By default, SentencePiece normalizes
the input with Unicode NFKC. You can pass a comma-separated list of files.
* `--model_prefix`: output model name prefix. `.model` and `.vocab` are generated.
* `--vocab_size`: vocabulary size, e.g., 8000, 16000, or 32000
* `--character_coverage`: amount of characters covered by the model, good defaults are: `0.9995` for languages with rich character set like Japanese or Chinese and `1.0` for other languages with small character set.
* `--model_type`: model type. Choose from `unigram` (default), `bpe`, `char`, or `word`. The input sentence must be pretokenized when using `word` type.
Use `--help` flag to display all parameters for training, or see [here](doc/options.md) for an overview.
### Encode raw text into sentence pieces/ids
```
% spm_encode --model= --output_format=piece < input > output
% spm_encode --model= --output_format=id < input > output
```
Use `--extra_options` flag to insert the BOS/EOS markers or reverse the input sequence.
```
% spm_encode --extra_options=eos (add only)
% spm_encode --extra_options=bos:eos (add and )
% spm_encode --extra_options=reverse:bos:eos (reverse input and add and )
```
SentencePiece supports nbest segmentation and segmentation sampling with `--output_format=(nbest|sample)_(piece|id)` flags.
```
% spm_encode --model= --output_format=sample_piece --nbest_size=-1 --alpha=0.5 < input > output
% spm_encode --model= --output_format=nbest_id --nbest_size=10 < input > output
```
### Decode sentence pieces/ids into raw text
```
% spm_decode --model= --input_format=piece < input > output
% spm_decode --model= --input_format=id < input > output
```
Use `--extra_options` flag to decode the text in reverse order.
```
% spm_decode --extra_options=reverse < input > output
```
### End-to-End Example
```
% spm_train --input=data/botchan.txt --model_prefix=m --vocab_size=1000
unigram_model_trainer.cc(494) LOG(INFO) Starts training with :
input: "../data/botchan.txt"
...
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=1 size=1100 obj=10.4973 num_tokens=37630 num_tokens/piece=34.2091
trainer_interface.cc(272) LOG(INFO) Saving model: m.model
trainer_interface.cc(281) LOG(INFO) Saving vocabs: m.vocab
% echo "I saw a girl with a telescope." | spm_encode --model=m.model
▁I ▁saw ▁a ▁girl ▁with ▁a ▁ te le s c o pe .
% echo "I saw a girl with a telescope." | spm_encode --model=m.model --output_format=id
9 459 11 939 44 11 4 142 82 8 28 21 132 6
% echo "9 459 11 939 44 11 4 142 82 8 28 21 132 6" | spm_decode --model=m.model --input_format=id
I saw a girl with a telescope.
```
You can find that the original input sentence is restored from the vocabulary id sequence.
### Export vocabulary list
```
% spm_export_vocab --model= --output=
```
`````` stores a list of vocabulary and emission log probabilities. The vocabulary id corresponds to the line number in this file.
### Redefine special meta tokens
By default, SentencePiece uses Unknown (<unk>), BOS (<s>) and EOS (</s>) tokens which have the ids of 0, 1, and 2 respectively. We can redefine this mapping in the training phase as follows.
```
% spm_train --bos_id=0 --eos_id=1 --unk_id=5 --input=... --model_prefix=... --character_coverage=...
```
When setting -1 id e.g., ```bos_id=-1```, this special token is disabled. Note that the unknow id cannot be disabled. We can define an id for padding (<pad>) as ```--pad_id=3```.
If you want to assign another special tokens, please see [Use custom symbols](doc/special_symbols.md).
### Vocabulary restriction
```spm_encode``` accepts a ```--vocabulary``` and a ```--vocabulary_threshold``` option so that ```spm_encode``` will only produce symbols which also appear in the vocabulary (with at least some frequency). The background of this feature is described in [subword-nmt page](https://github.com/rsennrich/subword-nmt#best-practice-advice-for-byte-pair-encoding-in-nmt).
The usage is basically the same as that of ```subword-nmt```. Assuming that L1 and L2 are the two languages (source/target languages), train the shared spm model, and get resulting vocabulary for each:
```
% cat {train_file}.L1 {train_file}.L2 | shuffle > train
% spm_train --input=train --model_prefix=spm --vocab_size=8000 --character_coverage=0.9995
% spm_encode --model=spm.model --generate_vocabulary < {train_file}.L1 > {vocab_file}.L1
% spm_encode --model=spm.model --generate_vocabulary < {train_file}.L2 > {vocab_file}.L2
```
```shuffle``` command is used just in case because ```spm_train``` loads the first 10M lines of corpus by default.
Then segment train/test corpus with ```--vocabulary``` option
```
% spm_encode --model=spm.model --vocabulary={vocab_file}.L1 --vocabulary_threshold=50 < {test_file}.L1 > {test_file}.seg.L1
% spm_encode --model=spm.model --vocabulary={vocab_file}.L2 --vocabulary_threshold=50 < {test_file}.L2 > {test_file}.seg.L2
```
## Advanced topics
* [SentencePiece Experiments](doc/experiments.md)
* [SentencePieceProcessor C++ API](doc/api.md)
* [Use custom text normalization rules](doc/normalization.md)
* [Use custom symbols](doc/special_symbols.md)
* [Python Module](python/README.md)
* [TensorFlow Module](tensorflow/README.md)
* [Segmentation and training algorithms in detail]
sentencepiece-0.1.96/.travis.yml 0000644 0001750 0000176 00000011610 14062671741 016074 0 ustar kenhys docker language: cpp
matrix:
include:
- os: linux
env: IMAGE=ubuntu:latest COMMAND=build_linux_gcc_coverall_ubuntu RELEASE_FILES="$TRAVIS_BUILD_DIR/build/*.xz"
services: docker
- os: linux
env: IMAGE=ubuntu:focal COMMAND=build_linux_gcc_ubuntu
services: docker
- os: linux
env: IMAGE=ubuntu:bionic COMMAND=build_linux_gcc_ubuntu
services: docker
- os: linux
env: IMAGE=ubuntu:xenial COMMAND=build_linux_gcc_ubuntu
services: docker
- os: linux
env: IMAGE=ubuntu:trusty COMMAND=build_linux_gcc_ubuntu
services: docker
- os: linux
env: IMAGE=debian:stable COMMAND=build_linux_gcc_debian
services: docker
- os: linux
env: IMAGE=fedora:latest COMMAND=build_linux_gcc_fedora
services: docker
- os: linux
env: IMAGE=ubuntu:latest COMMAND=build_linux_clang_ubuntu
services: docker
- os: linux
arch: arm64
env: IMAGE=arm64v8/ubuntu:latest COMMAND=build_linux_gcc_ubuntu
services: docker
- os: linux
arch: ppc64le
env: IMAGE=ppc64le/ubuntu:latest COMMAND=build_linux_gcc_ubuntu
services: docker
- os: linux
arch: s390x
env: IMAGE=s390x/ubuntu:latest COMMAND=build_linux_gcc_ubuntu
services: docker
- os: linux
env: IMAGE=x86_64 COMMAND=make_py_wheel_py RELEASE_FILES="$TRAVIS_BUILD_DIR/python/dist/*manylinux*.whl"
script:
- $TRAVIS_BUILD_DIR/python/make_py_wheel.sh ${IMAGE}
- if [[ "$RELEASE_FILES" != "" ]]; then ls -l $RELEASE_FILES ; fi
services: docker
- os: linux
env: IMAGE=i686 COMMAND=make_py_wheel_py RELEASE_FILES="$TRAVIS_BUILD_DIR/python/dist/*manylinux*.whl"
script:
- $TRAVIS_BUILD_DIR/python/make_py_wheel.sh ${IMAGE}
- if [[ "$RELEASE_FILES" != "" ]]; then ls -l $RELEASE_FILES ; fi
services: docker
- os: linux
arch: arm64
env: IMAGE=aarch64 COMMAND=make_py_wheel_py RELEASE_FILES="$TRAVIS_BUILD_DIR/python/dist/*manylinux*.whl"
script:
- $TRAVIS_BUILD_DIR/python/make_py_wheel.sh ${IMAGE}
- if [[ "$RELEASE_FILES" != "" ]]; then ls -l $RELEASE_FILES ; fi
services: docker
- os: linux
arch: ppc64le
env: IMAGE=ppc64le COMMAND=make_py_wheel_py RELEASE_FILES="$TRAVIS_BUILD_DIR/python/dist/*manylinux*.whl"
script:
- $TRAVIS_BUILD_DIR/python/make_py_wheel.sh ${IMAGE}
- if [[ "$RELEASE_FILES" != "" ]]; then ls -l $RELEASE_FILES ; fi
services: docker
- os: linux
arch: s390x
env: IMAGE=s390x COMMAND=make_py_wheel_py RELEASE_FILES="$TRAVIS_BUILD_DIR/python/dist/*manylinux*.whl"
script:
- $TRAVIS_BUILD_DIR/python/make_py_wheel.sh ${IMAGE}
- if [[ "$RELEASE_FILES" != "" ]]; then ls -l $RELEASE_FILES ; fi
services: docker
- os: osx
osx_image: xcode9.4
env: IMAGE=native COMMAND=build_osx
- os: osx
osx_image: xcode9.4
env: IMAGE=native COMMAND=make_py_wheel_mac_py RELEASE_FILES="$TRAVIS_BUILD_DIR/python/dist/delocated_wheel/*.whl"
script:
- $TRAVIS_BUILD_DIR/python/make_py_wheel_mac.sh
- if [[ "$RELEASE_FILES" != "" ]]; then ls -l $RELEASE_FILES ; fi
script:
- $TRAVIS_BUILD_DIR/test.sh ${IMAGE} ${COMMAND}
- if [[ "$RELEASE_FILES" != "" ]]; then ls -l $RELEASE_FILES ; fi
deploy:
provider: releases
skip_cleanup: true
api_key:
secure: WnrgfoRVSoi+E2YwFDgpQlxldfYQycN8DmMqbJab6uP0FWTmPptS9nmXWVGsXJS1u+sTsx/E+lM5xggl31u88hUJYsEUg+xPszSf+eiLfmdoEY+qYj2Vsuh7cT7P1tBScVMUiEQsoCcg9gZbHFHkSYJ74gyQxQhqJ52UmCJ1aNcp3nbtzgjBGvtsi2WBUdG1jSW0qwRj9gcq9eOWA4zkeHj9QKWhBtRD7fhpUiUDWVqaDSMu1E10QLNjkZ//qwbrWXb4MBzCa1ISla/ZoKv4TMQQrzYEwqxmbX2bxk1lMkJD3sKt3Wq/qNWDYaPKk9gz/cU9nAKwzSlJzus5c9pac6U/mh0IU8JhEGlkzFb1Ng3cHLdYT0hk0jAW15Ptcijqt+UGs0Arb1pdKvQV2e5bLEBrujCNGF8NFdsE23WDofEM/VKXuMNWW/j6b+VLESf05rz5p07IBMczLfW/Qs8mY5cqR9WaqPbYxMZlgwxtD+MiKERHlq1qVdK25M1UuB0wH/EbstVuEX2iNZRvffT9A+NglriLR74vNiCnfRlzGx4U4/Z79r2mwFrJTGupgq9N/jvKMs92qrT200VRtIto3JLEd3cnlM/9Gpv39SsYKA0seHKBpyFz/pGfXkOStv+14hzmEmXIFwG1QRTeFsZIUzmvvfMuhaG8Jjhdwpfvr68=
file_glob: true
file: "${RELEASE_FILES}"
on:
branch: master
tags: true
condition: $RELEASE_FILES != ""
env:
global:
secure: J52dK8uM1haWOP5Ktz01VETiYdpyOKtnGZXcZjxEXI7RV+44/MpkSSpKFrIex1jHDodn01Tv+/otmxotaz1HOPv4DgT2gg8FbHlpvnc6+B1/dEaeCDvnd33odmARoOszP0MNFTZdlvg6zGeJwPDYFfITn1jiFBtjazu19VIbQE4D1CSKkWsMXeyH1WjTb0LEtxhYwUcFgNqDb6trArx8xlvZNrh2/j5nPgAzvmuT0JuzwcRz9swwZftKcMjaK5JooSBTydtAzgVpVMZf1q+pF0nR9VlYIY34qQLsWirBjWHGRKdkgAEEN4vEMD1BKbhkIn7TjEpWLrH3BZuJY8uXAfnxvT8KXns2fhA1EDjlP/5n2y1jXAjqCZX8o1dC2fn6qxpL1Qg1WE0n9mhOZLMpbzCpJjBumjQPPUsviggRUs4awSYv3JrYuavvXQZ9rFM634O7CLIDVmbqssVyIYMhgIqLFAWgDxTyAxt+67vUy5ONsAenMOJ6bO36pYZHWH53isCRblUD5nq6Dj6WrW9P7lQhAdhvZ+Hyt+zyVCCblDY9lAv1KetU4i9sDSNYUkQtFTPVBw8LE4JmEctuM7iC6YqeneffPzzDLsGZ70m66VT1L4MYg5h2fGbtRuQ1nPz0+k2CNibN7NegaY35d7gUosnJJF04AeOUcea4+rgQkVM=
sentencepiece-0.1.96/src/ 0000755 0001750 0000176 00000000000 14062671741 014553 5 ustar kenhys docker sentencepiece-0.1.96/src/builder_test.cc 0000644 0001750 0000176 00000016272 14062671741 017557 0 ustar kenhys docker // Copyright 2016 Google Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.!
#include "builder.h"
#include "common.h"
#include "filesystem.h"
#include "normalizer.h"
#include "sentencepiece_trainer.h"
#include "testharness.h"
#include "third_party/absl/strings/str_cat.h"
#include "util.h"
namespace sentencepiece {
namespace normalizer {
// Space symbol
#define WS "\xe2\x96\x81"
TEST(BuilderTest, RemoveRedundantMapTest) {
Builder::CharsMap chars_map;
// ab => AB, a => A, b => B, abc => BCA
chars_map[{0x0061}] = {0x0041};
chars_map[{0x0062}] = {0x0042};
chars_map[{0x0061, 0x0062}] = {0x0041, 0x0042};
chars_map[{0x0061, 0x0062, 0x0063}] = {0x0043, 0x0042, 0x0041};
EXPECT_TRUE(Builder::RemoveRedundantMap(&chars_map).ok());
EXPECT_EQ(3, chars_map.size());
EXPECT_EQ(chars_map.end(), chars_map.find({0x0061, 0x0062}));
EXPECT_NE(chars_map.end(), chars_map.find({0x0061}));
EXPECT_NE(chars_map.end(), chars_map.find({0x0062}));
EXPECT_NE(chars_map.end(), chars_map.find({0x0061, 0x0062, 0x0063}));
}
TEST(BuilderTest, GetPrecompiledCharsMapWithInvalidNameTest) {
std::string output;
EXPECT_FALSE(Builder::GetPrecompiledCharsMap("", &output).ok());
EXPECT_FALSE(Builder::GetPrecompiledCharsMap("__UNKNOWN__", &output).ok());
}
TEST(BuilderTest, BuildNFKCMapTest) {
Builder::CharsMap chars_map;
#ifdef ENABLE_NFKC_COMPILE
EXPECT_TRUE(Builder::BuildNFKCMap(&chars_map).ok());
EXPECT_TRUE(!chars_map.empty());
#else
EXPECT_TRUE(Builder::BuildNFKCMap(&chars_map).ok());
#endif
}
TEST(BuilderTest, GetPrecompiledCharsMapTest) {
{
const NormalizerSpec spec =
SentencePieceTrainer::GetNormalizerSpec("nmt_nfkc");
const Normalizer normalizer(spec);
EXPECT_EQ(WS "ABC", normalizer.Normalize("ABC"));
EXPECT_EQ(WS "(株)", normalizer.Normalize("㈱"));
EXPECT_EQ(WS "グーグル", normalizer.Normalize("グーグル"));
}
{
const NormalizerSpec spec =
SentencePieceTrainer::GetNormalizerSpec("nfkc_cf");
const Normalizer normalizer(spec);
EXPECT_EQ(WS "abc", normalizer.Normalize("ABC"));
EXPECT_EQ(WS "abc", normalizer.Normalize("ABC"));
}
{
const NormalizerSpec spec =
SentencePieceTrainer::GetNormalizerSpec("nmt_nfkc_cf");
const Normalizer normalizer(spec);
EXPECT_EQ(WS "abc", normalizer.Normalize("ABC"));
EXPECT_EQ(WS "abc", normalizer.Normalize("ABC"));
}
{
const NormalizerSpec spec =
SentencePieceTrainer::GetNormalizerSpec("identity");
EXPECT_TRUE(spec.precompiled_charsmap().empty());
const Normalizer normalizer(spec);
EXPECT_EQ(WS "ABC", normalizer.Normalize("ABC"));
EXPECT_EQ(WS "㈱", normalizer.Normalize("㈱"));
EXPECT_EQ(WS "グーグル", normalizer.Normalize("グーグル"));
}
}
TEST(BuilderTest, CompileCharsMap) {
Builder::CharsMap chars_map;
// Lowercase => Uppercase
for (char32 lc = static_cast('a'); lc <= static_cast('z');
++lc) {
const char32 uc = lc + 'A' - 'a';
chars_map[{lc}] = {uc};
}
// あいう => abc
chars_map[{0x3042, 0x3044, 0x3046}] = {0x0061, 0x0062, 0x0063};
// えお => remove
chars_map[{0x3048, 0x304A}] = {};
NormalizerSpec spec;
EXPECT_TRUE(
Builder::CompileCharsMap(chars_map, spec.mutable_precompiled_charsmap())
.ok());
Builder::CharsMap decompiled_chars_map;
EXPECT_TRUE(Builder::DecompileCharsMap(spec.precompiled_charsmap(),
&decompiled_chars_map)
.ok());
EXPECT_EQ(chars_map, decompiled_chars_map);
spec.set_add_dummy_prefix(false);
const Normalizer normalizer(spec);
EXPECT_EQ("ABC", normalizer.Normalize("abc"));
EXPECT_EQ("ABC", normalizer.Normalize("ABC"));
EXPECT_EQ("XY" WS "Z", normalizer.Normalize("xy z"));
EXPECT_EQ("あ", normalizer.Normalize("あ"));
EXPECT_EQ("abc", normalizer.Normalize("あいう"));
EXPECT_EQ("abcえ", normalizer.Normalize("あいうえ"));
EXPECT_EQ("ABCabcD", normalizer.Normalize("abcあいうd"));
EXPECT_EQ("abcか", normalizer.Normalize("あいうえおか"));
}
static constexpr char kTestInputData[] = "nfkc.tsv";
TEST(BuilderTest, LoadCharsMapTest) {
Builder::CharsMap chars_map;
ASSERT_TRUE(
Builder::LoadCharsMap(
util::JoinPath(absl::GetFlag(FLAGS_test_srcdir), kTestInputData),
&chars_map)
.ok());
std::string precompiled, expected;
ASSERT_TRUE(Builder::CompileCharsMap(chars_map, &precompiled).ok());
// Round-trip.
Builder::CharsMap decompiled_chars_map;
ASSERT_TRUE(
Builder::DecompileCharsMap(precompiled, &decompiled_chars_map).ok());
EXPECT_EQ(chars_map, decompiled_chars_map);
ASSERT_TRUE(
Builder::SaveCharsMap(
util::JoinPath(absl::GetFlag(FLAGS_test_tmpdir), "output.tsv"),
chars_map)
.ok());
Builder::CharsMap saved_chars_map;
ASSERT_TRUE(
Builder::LoadCharsMap(
util::JoinPath(absl::GetFlag(FLAGS_test_tmpdir), "output.tsv"),
&saved_chars_map)
.ok());
EXPECT_EQ(chars_map, saved_chars_map);
#ifdef ENABLE_NFKC_COMPILE
Builder::CharsMap nfkc_map;
ASSERT_TRUE(Builder::BuildNFKCMap(&nfkc_map).ok());
ASSERT_TRUE(Builder::CompileCharsMap(nfkc_map, &expected).ok());
#endif
}
TEST(BuilderTest, LoadCharsMapWithEmptyeTest) {
{
auto output = filesystem::NewWritableFile(
util::JoinPath(absl::GetFlag(FLAGS_test_tmpdir), "test.tsv"));
output->WriteLine("0061\t0041");
output->WriteLine("0062");
output->WriteLine("0063\t\t#foo=>bar");
}
Builder::CharsMap chars_map;
EXPECT_TRUE(Builder::LoadCharsMap(
util::JoinPath(absl::GetFlag(FLAGS_test_tmpdir), "test.tsv"),
&chars_map)
.ok());
EXPECT_EQ(3, chars_map.size());
EXPECT_EQ(std::vector({0x0041}), chars_map[{0x0061}]);
EXPECT_EQ(std::vector({}), chars_map[{0x0062}]);
EXPECT_EQ(std::vector({}), chars_map[{0x0063}]);
EXPECT_TRUE(
Builder::SaveCharsMap(
util::JoinPath(absl::GetFlag(FLAGS_test_tmpdir), "test_out.tsv"),
chars_map)
.ok());
Builder::CharsMap new_chars_map;
EXPECT_TRUE(
Builder::LoadCharsMap(
util::JoinPath(absl::GetFlag(FLAGS_test_tmpdir), "test_out.tsv"),
&new_chars_map)
.ok());
EXPECT_EQ(chars_map, new_chars_map);
}
TEST(BuilderTest, ContainsTooManySharedPrefixTest) {
Builder::CharsMap chars_map;
std::vector keys;
// chars_map contains too many shared prefix ("aaaa...");
for (int i = 0; i < 100; ++i) {
keys.push_back('a');
chars_map[keys] = {'b'};
}
std::string output;
EXPECT_FALSE(Builder::CompileCharsMap(chars_map, &output).ok());
}
} // namespace normalizer
} // namespace sentencepiece
sentencepiece-0.1.96/src/bpe_model_trainer.cc 0000644 0001750 0000176 00000024122 14062671741 020535 0 ustar kenhys docker // Copyright 2016 Google Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.!
#include
#include
#include
#include
#include "bpe_model_trainer.h"
#include "third_party/absl/container/flat_hash_set.h"
#include "util.h"
namespace sentencepiece {
namespace bpe {
std::string Trainer::Symbol::ToString() const {
return string_util::UnicodeTextToUTF8(chars);
}
Trainer::Symbol *Trainer::GetCharSymbol(char32 c) {
const uint64 freq = port::FindWithDefault(required_chars_, c, 1);
CHECK_GT(freq, 0);
const auto it = symbols_cache_.find(c);
if (it != symbols_cache_.end()) {
return it->second;
}
Symbol *s = new Symbol;
allocated_.push_back(s);
s->is_unk = (kUNKChar == c);
s->fp = c;
s->chars.push_back(c);
s->freq = freq;
port::InsertOrDie(&symbols_cache_, s->fp, s);
return s;
}
Trainer::Symbol *Trainer::GetPairSymbol(const Symbol *left,
const Symbol *right) {
if (left == nullptr || right == nullptr || left->is_unk || right->is_unk) {
return nullptr;
}
const uint64 fp = port::FingerprintCat(left->fp, right->fp);
const auto it = symbols_cache_.find(fp);
if (it != symbols_cache_.end()) {
return it->second;
}
CHECK(!left->chars.empty());
CHECK(!right->chars.empty());
string_util::UnicodeText ut;
for (const char32 c : left->chars) ut.push_back(c);
for (const char32 c : right->chars) ut.push_back(c);
// Do not make an invalid piece.
if (!IsValidSentencePiece(ut)) {
return nullptr;
}
Symbol *s = new Symbol;
allocated_.push_back(s);
s->fp = fp;
s->left = left;
s->right = right;
s->chars = ut;
port::InsertOrDie(&symbols_cache_, s->fp, s);
return s;
}
void Trainer::ComputeFreq(Symbol *symbol) const {
if (symbol->freq > 0) { // if freq == 0, re-computation is required.
return;
}
// Avoids double-count. ("AAA" => only count the first "AA").
Position prev_pos = {-1, 0};
CHECK_EQ(0, symbol->freq);
for (auto it = symbol->positions.begin(); it != symbol->positions.end();) {
const Position pos = DecodePos(*it);
// There are two same bigrams in "AAA", [AA] [AA], and we want to
// remove the second one to avoid double counts.
// If the right symbol in the first bigram and the left symbol in the
// second bigram have the same position, (pos.left == prev_pos.right),
// duplicated bigram exisit.
// Also, symbols_[sid][left] and symbols_[sid]right] must store
// the same symbols in symbol->left and symbols->right.
if ((pos.sid == prev_pos.sid && pos.left == prev_pos.right) ||
symbol->left != symbols_[pos.sid][pos.left] ||
symbol->right != symbols_[pos.sid][pos.right]) {
it = symbol->positions.erase(it);
// Initializes prev_pos.
// In "AAAA", the last "AA" can be counted.
prev_pos = {-1, 0};
} else {
symbol->freq += sentences_[pos.sid].second;
prev_pos = pos;
++it;
}
}
}
int Trainer::GetNextIndex(int sid, int index) const {
for (size_t i = index + 1; i < symbols_[sid].size(); ++i) {
if (symbols_[sid][i] == nullptr) continue;
return i;
}
return -1;
}
int Trainer::GetPrevIndex(int sid, int index) const {
for (int i = index - 1; i >= 0; --i) {
if (symbols_[sid][i] == nullptr) continue;
return i;
}
return -1;
}
void Trainer::AddNewPair(int sid, int left, int right) {
if (left == -1 || right == -1) return;
auto *symbol = GetPairSymbol(symbols_[sid][left], symbols_[sid][right]);
if (symbol != nullptr) {
active_symbols_.insert(symbol);
symbol->positions.insert(EncodePos(sid, left, right));
}
}
void Trainer::ResetFreq(int sid, int left, int right, const Symbol *best) {
if (left == -1 || right == -1) return;
auto *symbol = GetPairSymbol(symbols_[sid][left], symbols_[sid][right]);
if (symbol != nullptr && symbol != best) {
symbol->freq = 0;
}
}
void Trainer::UpdateActiveSymbols() {
std::vector symbols;
for (auto &it : symbols_cache_) {
Symbol *symbol = it.second;
if (symbol->IsBigram()) {
ComputeFreq(symbol);
symbols.push_back(symbol);
}
}
// At least kMinActiveSymbolsSize symbols must be in |active_symbols_|.
constexpr int kMinActiveSymbolsSize = 1000;
// Keeps top 5% frequent symbols.
constexpr float kTopFrequentRatio = 0.05;
const int size =
std::min(std::max(kMinActiveSymbolsSize,
symbols_cache_.size() * kTopFrequentRatio),
symbols.size());
std::partial_sort(symbols.begin(), symbols.begin() + size, symbols.end(),
[](Symbol *s1, Symbol *s2) { return s1->freq > s2->freq; });
LOG(INFO) << "Updating active symbols. max_freq=" << symbols[0]->freq
<< " min_freq=" << symbols[size - 1]->freq;
active_symbols_.clear();
active_symbols_.insert(symbols.begin(), symbols.begin() + size);
}
util::Status Trainer::Train() {
RETURN_IF_ERROR(status());
CHECK_OR_RETURN(normalizer_spec_.escape_whitespaces());
CHECK_EQ_OR_RETURN(TrainerSpec::BPE, trainer_spec_.model_type());
symbols_.clear();
allocated_.clear();
symbols_cache_.clear();
active_symbols_.clear();
// Load all sentences
RETURN_IF_ERROR(LoadSentences());
if (trainer_spec_.split_by_whitespace()) {
SplitSentencesByWhitespace();
}
// Initializes symbols_. symbols_[sid][i] stores an unary symbol.
symbols_.resize(sentences_.size());
for (size_t i = 0; i < sentences_.size(); ++i) {
for (const char32 c : string_util::UTF8ToUnicodeText(sentences_[i].first)) {
symbols_[i].push_back(GetCharSymbol(c));
}
}
// Makes all bigram symbols.
for (size_t sid = 0; sid < symbols_.size(); ++sid) {
for (size_t i = 1; i < symbols_[sid].size(); ++i) {
AddNewPair(sid, i - 1, i);
}
}
const int vocab_size =
trainer_spec_.vocab_size() - meta_pieces_.size() - required_chars_.size();
CHECK_GE_OR_RETURN(vocab_size, 0);
// We may see duplicated pieces that are extracted with different path.
// In real segmentation phase, we can consider them as one symbol.
// e.g., "aaa" => "aa" + "a" or "a" + "aa".
absl::flat_hash_set dup;
// Main loop.
CHECK_OR_RETURN(final_pieces_.empty());
while (final_pieces_.size() < static_cast(vocab_size)) {
constexpr int kUpdateActiveSymbolsInteval = 100;
if (final_pieces_.size() % kUpdateActiveSymbolsInteval == 0) {
UpdateActiveSymbols();
}
// Scanning active symbols, finds the best_symbol with highest freq.
Symbol *best_symbol = nullptr;
for (auto &it : active_symbols_) {
Symbol *symbol = it;
ComputeFreq(symbol);
// If the frequency is the same, take shorter symbol.
// if the length is the same, use lexicographical comparison
if (best_symbol == nullptr ||
(symbol->freq > best_symbol->freq ||
(symbol->freq == best_symbol->freq &&
(symbol->chars.size() < best_symbol->chars.size() ||
(symbol->chars.size() == best_symbol->chars.size() &&
symbol->ToString() < best_symbol->ToString()))))) {
best_symbol = symbol;
}
}
if (best_symbol == nullptr) {
LOG(WARNING) << "No valid symbol found";
break;
}
if (!dup.insert(best_symbol->ToString()).second) {
// Removes best_symbol so it is not selected again.
symbols_cache_.erase(best_symbol->fp);
active_symbols_.erase(best_symbol);
continue;
}
// Stores the best_symbol in the final output.
final_pieces_.emplace_back(best_symbol->ToString(),
-static_cast(final_pieces_.size()));
if (final_pieces_.size() % 20 == 0) {
LOG(INFO) << "Added: freq=" << best_symbol->freq
<< " size=" << final_pieces_.size()
<< " all=" << symbols_cache_.size()
<< " active=" << active_symbols_.size()
<< " piece=" << best_symbol->ToString();
}
// Add new bigrams which are created after symbol replacement.
// We do not need to scan all characters, but scan the neighbors in
// best_symbol.
for (const uint64 &encoded_pos : best_symbol->positions) {
const Position pos = DecodePos(encoded_pos);
if (symbols_[pos.sid][pos.left] == nullptr) {
// left index might be NULL (set in the previous iteration)
// when left_symbol == right_symbol.
continue;
}
CHECK_OR_RETURN(symbols_[pos.sid][pos.right]);
// We have three bigrams [prev, left], [left, right], [right, next],
// which are affected with this symbol replacement.
const int next = GetNextIndex(pos.sid, pos.right);
const int prev = GetPrevIndex(pos.sid, pos.left);
// Resets the frequencies of bigrams [prev, left] and [right, next].
ResetFreq(pos.sid, prev, pos.left, best_symbol);
ResetFreq(pos.sid, pos.right, next, best_symbol);
// Merges two symbols.
symbols_[pos.sid][pos.left] = best_symbol;
symbols_[pos.sid][pos.right] = nullptr;
// Makes new symbol bigrams [prev, left] and [left, next].
AddNewPair(pos.sid, prev, pos.left);
AddNewPair(pos.sid, pos.left, next);
}
// Removes best_symbol so it is not selected again.
symbols_cache_.erase(best_symbol->fp);
active_symbols_.erase(best_symbol);
} // end of main loop
// Adds required_chars_
for (const auto &w : Sorted(required_chars_)) {
const Symbol *symbol = GetCharSymbol(w.first);
final_pieces_.emplace_back(symbol->ToString(),
-static_cast(final_pieces_.size()));
}
port::STLDeleteElements(&allocated_);
return Save();
}
} // namespace bpe
} // namespace sentencepiece
sentencepiece-0.1.96/src/testharness.h 0000644 0001750 0000176 00000021016 14062671741 017267 0 ustar kenhys docker // Copyright 2016 Google Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.!
#ifndef TESTHARNESS_H_
#define TESTHARNESS_H_
#include
#include
#include
#include
#include "common.h"
#include "third_party/absl/flags/flag.h"
#include "third_party/absl/flags/parse.h"
#include "third_party/absl/strings/string_view.h"
ABSL_DECLARE_FLAG(std::string, test_tmpdir);
ABSL_DECLARE_FLAG(std::string, test_srcdir);
namespace sentencepiece {
namespace test {
// Run some of the tests registered by the TEST() macro.
// TEST(Foo, Hello) { ... }
// TEST(Foo, World) { ... }
//
// Returns 0 if all tests pass.
// Dies or returns a non-zero value if some test fails.
int RunAllTests();
// An instance of Tester is allocated to hold temporary state during
// the execution of an assertion.
class Tester {
public:
Tester(const char *fname, int line) : ok_(true), fname_(fname), line_(line) {}
~Tester() {
if (!ok_) {
std::cerr << "[ NG ] " << fname_ << ":" << line_ << ":" << ss_.str()
<< std::endl;
exit(-1);
}
}
Tester &Is(bool b, const char *msg) {
if (!b) {
ss_ << " failed: " << msg;
ok_ = false;
}
return *this;
}
Tester &IsNear(double val1, double val2, double abs_error, const char *msg1,
const char *msg2) {
const double diff = std::fabs(val1 - val2);
if (diff > abs_error) {
ss_ << "The difference between (" << msg1 << ") and (" << msg2 << ") is "
<< diff << ", which exceeds " << abs_error << ", where\n"
<< msg1 << " evaluates to " << val1 << ",\n"
<< msg2 << " evaluates to " << val2;
ok_ = false;
}
return *this;
}
#define BINARY_OP(name, op) \
template \
Tester &name(const X &x, const Y &y, const char *msg1, const char *msg2) { \
if (!(x op y)) { \
ss_ << " failed: " << msg1 << (" " #op " ") << msg2; \
ok_ = false; \
} \
return *this; \
}
BINARY_OP(IsEq, ==)
BINARY_OP(IsNe, !=)
BINARY_OP(IsGe, >=)
BINARY_OP(IsGt, >)
BINARY_OP(IsLe, <=)
BINARY_OP(IsLt, <)
#undef BINARY_OP
// Attach the specified value to the error message if an error has occurred
template
Tester &operator<<(const V &value) {
if (!ok_) {
ss_ << " " << value;
}
return *this;
}
private:
bool ok_;
const char *fname_;
int line_;
std::stringstream ss_;
};
#define EXPECT_TRUE(c) \
sentencepiece::test::Tester(__FILE__, __LINE__).Is((c), #c)
#define EXPECT_FALSE(c) \
sentencepiece::test::Tester(__FILE__, __LINE__).Is((!(c)), #c)
#define EXPECT_STREQ(a, b) \
sentencepiece::test::Tester(__FILE__, __LINE__) \
.IsEq(std::string(a), std::string(b), #a, #b)
#define EXPECT_EQ(a, b) \
sentencepiece::test::Tester(__FILE__, __LINE__).IsEq((a), (b), #a, #b)
#define EXPECT_NE(a, b) \
sentencepiece::test::Tester(__FILE__, __LINE__).IsNe((a), (b), #a, #b)
#define EXPECT_GE(a, b) \
sentencepiece::test::Tester(__FILE__, __LINE__).IsGe((a), (b), #a, #b)
#define EXPECT_GT(a, b) \
sentencepiece::test::Tester(__FILE__, __LINE__).IsGt((a), (b), #a, #b)
#define EXPECT_LE(a, b) \
sentencepiece::test::Tester(__FILE__, __LINE__).IsLe((a), (b), #a, #b)
#define EXPECT_LT(a, b) \
sentencepiece::test::Tester(__FILE__, __LINE__).IsLt((a), (b), #a, #b)
#define EXPECT_NEAR(a, b, c) \
sentencepiece::test::Tester(__FILE__, __LINE__).IsNear((a), (b), (c), #a, #b)
#define EXPECT_OK(c) EXPECT_EQ(c, ::sentencepiece::util::OkStatus())
#define EXPECT_NOT_OK(c) EXPECT_NE(c, ::sentencepiece::util::OkStatus())
#define EXPECT_DEATH(statement, condition) \
{ \
sentencepiece::error::SetTestCounter(1); \
statement; \
sentencepiece::error::SetTestCounter(0); \
};
#define ASSERT_TRUE EXPECT_TRUE
#define ASSERT_FALSE EXPECT_FALSE
#define ASSERT_STREQ EXPECT_STREQ
#define ASSERT_EQ EXPECT_EQ
#define ASSERT_NE EXPECT_NE
#define ASSERT_GE EXPECT_GE
#define ASSERT_GT EXPECT_GT
#define ASSERT_LE EXPECT_LE
#define ASSERT_LT EXPECT_LT
#define ASSERT_NEAR EXPECT_NEAR
#define ASSERT_NOT_OK EXPECT_NOT_OK
#define ASSERT_DEATH ASSERT_DEATH
template
class TestWithParam {
public:
using ParamType = T;
virtual void SetUp() {}
virtual void TearDown() {}
virtual ~TestWithParam() {}
virtual ParamType GetParam() const { return ParamType(); }
};
template
std::vector ValuesIn(const std::vector &v) {
return v;
}
#define TCONCAT(a, b, c) TCONCAT1(a, b, c)
#define TCONCAT1(a, b, c) a##b##c
#define INSTANTIATE_TEST_SUITE_P(suite_base, base, params) \
std::vector TCONCAT(base, _get_params_, base)() { \
return params; \
}
#define TEST(base, name) \
class TCONCAT(base, _Test_, name) { \
public: \
void _Run(); \
static void _RunIt() { \
TCONCAT(base, _Test_, name) t; \
t._Run(); \
} \
}; \
bool TCONCAT(base, _Test_ignored_, name) = \
sentencepiece::test::RegisterTest(#base, #name, \
&TCONCAT(base, _Test_, name)::_RunIt); \
void TCONCAT(base, _Test_, name)::_Run()
#define TEST_P(base, name) \
std::vector TCONCAT(base, _get_params_, base)(); \
class TCONCAT(base, _Test_p_, name) : public base { \
public: \
const std::vector GetParams() const { \
return TCONCAT(base, _get_params_, base)(); \
} \
ParamType param_; \
void SetParam(const ParamType ¶m) { param_ = param; } \
const ParamType GetParam() { return param_; } \
void _Run(); \
static void _RunIt() { \
TCONCAT(base, _Test_p_, name) t; \
for (const auto ¶m : t.GetParams()) { \
t.SetParam(param); \
t.SetUp(); \
t._Run(); \
t.TearDown(); \
} \
} \
}; \
bool TCONCAT(base, _Test_p_ignored_, name) = \
sentencepiece::test::RegisterTest( \
#base, #name, &TCONCAT(base, _Test_p_, name)::_RunIt); \
void TCONCAT(base, _Test_p_, name)::_Run()
// Register the specified test. Typically not used directly, but
// invoked via the macro expansion of TEST.
extern bool RegisterTest(const char *base, const char *name, void (*func)());
} // namespace test
} // namespace sentencepiece
#endif // TESTHARNESS_H_
sentencepiece-0.1.96/src/spm_encode_main.cc 0000644 0001750 0000176 00000014635 14062671741 020213 0 ustar kenhys docker // Copyright 2016 Google Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.!
#include
#include
#include
#include "common.h"
#include "filesystem.h"
#include "init.h"
#include "sentencepiece.pb.h"
#include "sentencepiece_processor.h"
#include "third_party/absl/container/flat_hash_map.h"
#include "third_party/absl/flags/flag.h"
#include "third_party/absl/strings/str_cat.h"
#include "third_party/absl/strings/str_join.h"
#include "trainer_interface.h"
ABSL_FLAG(std::string, model, "", "model file name");
ABSL_FLAG(
std::string, output_format, "piece",
"choose from piece, id, proto, nbest_piece, nbest_id, or nbest_proto");
ABSL_FLAG(std::string, input, "", "input filename");
ABSL_FLAG(std::string, output, "", "output filename");
ABSL_FLAG(std::string, extra_options, "",
"':' separated encoder extra options, e.g., \"reverse:bos:eos\"");
ABSL_FLAG(int32, nbest_size, 10, "NBest size");
ABSL_FLAG(double, alpha, 0.5, "Smoothing parameter for sampling mode.");
ABSL_FLAG(uint32, random_seed, static_cast(-1),
"Seed value for random generator.");
// Piece restriction with vocabulary file.
// https://github.com/rsennrich/subword-nmt#best-practice-advice-for-byte-pair-encoding-in-nmt
ABSL_FLAG(std::string, vocabulary, "",
"Restrict the vocabulary. The encoder only emits the "
"tokens in \"vocabulary\" file");
ABSL_FLAG(int32, vocabulary_threshold, 0,
"Words with frequency < threshold will be treated as OOV");
ABSL_FLAG(bool, generate_vocabulary, false,
"Generates vocabulary file instead of segmentation");
int main(int argc, char *argv[]) {
sentencepiece::ParseCommandLineFlags(argv[0], &argc, &argv, true);
std::vector rest_args;
if (absl::GetFlag(FLAGS_input).empty()) {
for (int i = 1; i < argc; ++i) {
rest_args.push_back(std::string(argv[i]));
}
} else {
rest_args.push_back(absl::GetFlag(FLAGS_input));
}
if (absl::GetFlag(FLAGS_random_seed) != -1) {
sentencepiece::SetRandomGeneratorSeed(absl::GetFlag(FLAGS_random_seed));
}
if (rest_args.empty())
rest_args.push_back(""); // empty means that reading from stdin.
CHECK(!absl::GetFlag(FLAGS_model).empty());
sentencepiece::SentencePieceProcessor sp;
CHECK_OK(sp.Load(absl::GetFlag(FLAGS_model)));
CHECK_OK(sp.SetEncodeExtraOptions(absl::GetFlag(FLAGS_extra_options)));
if (!absl::GetFlag(FLAGS_vocabulary).empty()) {
CHECK_OK(sp.LoadVocabulary(absl::GetFlag(FLAGS_vocabulary),
absl::GetFlag(FLAGS_vocabulary_threshold)));
}
auto output =
sentencepiece::filesystem::NewWritableFile(absl::GetFlag(FLAGS_output));
CHECK_OK(output->status());
std::string line;
std::vector sps;
std::vector ids;
std::vector> nbest_sps;
std::vector> nbest_ids;
absl::flat_hash_map vocab;
sentencepiece::SentencePieceText spt;
sentencepiece::NBestSentencePieceText nbest_spt;
std::function process;
const int nbest_size = absl::GetFlag(FLAGS_nbest_size);
const float alpha = absl::GetFlag(FLAGS_alpha);
if (absl::GetFlag(FLAGS_generate_vocabulary)) {
process = [&](const std::string &line) {
CHECK_OK(sp.Encode(line, &spt));
for (const auto &piece : spt.pieces()) {
if (!sp.IsUnknown(piece.id()) && !sp.IsControl(piece.id()))
vocab[piece.piece()]++;
}
};
} else if (absl::GetFlag(FLAGS_output_format) == "piece") {
process = [&](const std::string &line) {
CHECK_OK(sp.Encode(line, &sps));
output->WriteLine(absl::StrJoin(sps, " "));
};
} else if (absl::GetFlag(FLAGS_output_format) == "id") {
process = [&](const std::string &line) {
CHECK_OK(sp.Encode(line, &ids));
output->WriteLine(absl::StrJoin(ids, " "));
};
} else if (absl::GetFlag(FLAGS_output_format) == "proto") {
process = [&](const std::string &line) { CHECK_OK(sp.Encode(line, &spt)); };
} else if (absl::GetFlag(FLAGS_output_format) == "sample_piece") {
process = [&](const std::string &line) {
CHECK_OK(sp.SampleEncode(line, nbest_size, alpha, &sps));
output->WriteLine(absl::StrJoin(sps, " "));
};
} else if (absl::GetFlag(FLAGS_output_format) == "sample_id") {
process = [&](const std::string &line) {
CHECK_OK(sp.SampleEncode(line, nbest_size, alpha, &ids));
output->WriteLine(absl::StrJoin(ids, " "));
};
} else if (absl::GetFlag(FLAGS_output_format) == "sample_proto") {
process = [&](const std::string &line) {
CHECK_OK(sp.SampleEncode(line, nbest_size, alpha, &spt));
};
} else if (absl::GetFlag(FLAGS_output_format) == "nbest_piece") {
process = [&](const std::string &line) {
CHECK_OK(sp.NBestEncode(line, nbest_size, &nbest_sps));
for (const auto &result : nbest_sps) {
output->WriteLine(absl::StrJoin(result, " "));
}
};
} else if (absl::GetFlag(FLAGS_output_format) == "nbest_id") {
process = [&](const std::string &line) {
CHECK_OK(sp.NBestEncode(line, nbest_size, &nbest_ids));
for (const auto &result : nbest_ids) {
output->WriteLine(absl::StrJoin(result, " "));
}
};
} else if (absl::GetFlag(FLAGS_output_format) == "nbest_proto") {
process = [&](const std::string &line) {
CHECK_OK(sp.NBestEncode(line, nbest_size, &nbest_spt));
};
} else {
LOG(FATAL) << "Unknown output format: "
<< absl::GetFlag(FLAGS_output_format);
}
for (const auto &filename : rest_args) {
auto input = sentencepiece::filesystem::NewReadableFile(filename);
CHECK_OK(input->status());
while (input->ReadLine(&line)) {
process(line);
}
}
if (absl::GetFlag(FLAGS_generate_vocabulary)) {
for (const auto &it : sentencepiece::Sorted(vocab)) {
output->WriteLine(it.first + "\t" +
sentencepiece::string_util::SimpleItoa(it.second));
}
}
return 0;
}
sentencepiece-0.1.96/src/word_model.cc 0000644 0001750 0000176 00000002144 14062671741 017216 0 ustar kenhys docker // Copyright 2016 Google Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.!
#include "util.h"
#include "word_model.h"
namespace sentencepiece {
namespace word {
Model::Model(const ModelProto &model_proto) {
model_proto_ = &model_proto;
InitializePieces();
}
Model::~Model() {}
EncodeResult Model::Encode(absl::string_view normalized) const {
if (!status().ok() || normalized.empty()) {
return {};
}
EncodeResult output;
for (const auto &w : SplitIntoWords(normalized)) {
output.emplace_back(w, PieceToId(w));
}
return output;
}
} // namespace word
} // namespace sentencepiece
sentencepiece-0.1.96/src/unigram_model.cc 0000644 0001750 0000176 00000074763 14062671741 017725 0 ustar kenhys docker // Copyright 2016 Google Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.!
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include "third_party/absl/memory/memory.h"
#include "third_party/absl/strings/str_split.h"
#include "third_party/absl/strings/string_view.h"
#include "unigram_model.h"
#include "util.h"
namespace sentencepiece {
namespace unigram {
namespace {
// Size of nodes pre-allocated in Lattice.
constexpr size_t kPreallocateLatticeNodeSize = 1024;
constexpr float kUnkPenalty = 10.0;
constexpr float kEpsilon = 1e-7;
// Returns log(exp(x) + exp(y)).
// if init_mode is true, returns log(exp(y)) == y.
// log(\sum_i exp(a[i])) can be computed as
// for (int i = 0; i < a.size(); ++i)
// x = LogSumExp(x, a[i], i == 0);
inline float LogSumExp(float x, float y, bool init_mode) {
if (init_mode) {
return y;
}
const float vmin = std::min(x, y);
const float vmax = std::max(x, y);
constexpr float kMinusLogEpsilon = 50;
if (vmax > vmin + kMinusLogEpsilon) {
return vmax;
} else {
return vmax + log(std::exp(static_cast(vmin - vmax)) + 1.0);
}
}
// Returns a sample from a standard Gumbel distribution.
// If U ~ U[0, 1], -log(-log U) ~ G(0,1)
inline float Gumbel() {
const float kEpsilon = 1e-7;
auto *mt = random::GetRandomGenerator();
std::uniform_real_distribution dis(0.0, 1.0);
float noise = -std::log(-(std::log(dis(*mt) + kEpsilon)));
return noise;
}
} // namespace
Lattice::Lattice() : node_allocator_(kPreallocateLatticeNodeSize) {}
Lattice::~Lattice() {}
const std::vector &Lattice::begin_nodes(int pos) const {
return begin_nodes_[pos];
}
const std::vector &Lattice::end_nodes(int pos) const {
return end_nodes_[pos];
}
int Lattice::size() const {
// -1 because surface_ may include the EOS.
return std::max(0, surface_.size() - 1);
}
int Lattice::utf8_size() const { return sentence_.size(); }
const char *Lattice::sentence() const { return sentence_.data(); }
const char *Lattice::surface(int pos) const { return surface_[pos]; }
Lattice::Node *Lattice::bos_node() const { return end_nodes_[0][0]; }
Lattice::Node *Lattice::eos_node() const { return begin_nodes_[size()][0]; }
Lattice::Node *Lattice::NewNode() {
Node *node = node_allocator_.Allocate();
node->node_id = node_allocator_.size() - 1;
return node;
}
void Lattice::Clear() {
begin_nodes_.clear();
end_nodes_.clear();
sentence_ = absl::string_view("");
surface_.clear();
node_allocator_.Free();
}
void Lattice::SetSentence(absl::string_view sentence) {
Clear();
sentence_ = sentence;
surface_.reserve(sentence.size() + 1);
while (!sentence.empty()) {
const int mblen = std::min(string_util::OneCharLen(sentence.data()),
sentence.size());
surface_.push_back(sentence.data());
sentence.remove_prefix(mblen);
}
surface_.push_back(sentence.data());
const int len = size();
begin_nodes_.resize(len + 1);
end_nodes_.resize(len + 1);
constexpr size_t kReservedNodeSize = 16;
for (int i = 0; i <= len; ++i) {
begin_nodes_[i].reserve(kReservedNodeSize);
end_nodes_[i].reserve(kReservedNodeSize);
}
Node *bos = NewNode();
bos->id = -1;
bos->pos = 0;
end_nodes_[0].push_back(bos);
Node *eos = NewNode();
eos->id = -1;
eos->pos = len;
begin_nodes_[len].push_back(eos);
}
Lattice::Node *Lattice::Insert(int pos, int length) {
Node *node = NewNode();
node->pos = pos;
node->length = length;
const int utf8_length =
static_cast(surface(pos + length) - surface(pos));
node->piece = absl::string_view(surface(pos), utf8_length);
begin_nodes_[pos].push_back(node);
end_nodes_[pos + node->length].push_back(node);
return node;
}
Lattice::LatticePathWithScore Lattice::Viterbi() {
const int len = size();
for (int pos = 0; pos <= len; ++pos) {
for (Node *rnode : begin_nodes_[pos]) {
rnode->prev = nullptr;
float best_score = 0.0;
Node *best_node = nullptr;
for (Node *lnode : end_nodes_[pos]) {
const float score = lnode->backtrace_score + rnode->score;
if (best_node == nullptr || score > best_score) {
best_node = lnode;
best_score = score;
}
}
if (best_node == nullptr) {
LOG(ERROR) << "Failed to find the best path in Viterbi.";
return {};
}
rnode->prev = best_node;
rnode->backtrace_score = best_score;
}
}
// backtrace
std::vector results;
float score = begin_nodes(len)[0]->backtrace_score;
for (Node *node = begin_nodes_[len][0]->prev; node->prev != nullptr;
node = node->prev) {
results.push_back(node);
}
std::reverse(results.begin(), results.end());
LatticePathWithScore retval = {results, score};
return retval;
}
std::vector Lattice::ForwardAlgorithm(float theta) const {
const int len = size();
std::vector alpha(node_allocator_.size(), 0.0);
for (int pos = 0; pos <= len; ++pos) {
for (Node *rnode : begin_nodes_[pos]) {
for (Node *lnode : end_nodes_[pos]) {
alpha[rnode->node_id] = LogSumExp(
alpha[rnode->node_id], theta * lnode->score + alpha[lnode->node_id],
lnode == end_nodes_[pos][0]);
}
}
}
return alpha;
}
std::vector Lattice::BackwardAlgorithm(float theta) const {
const int len = size();
std::vector beta(node_allocator_.size(), 0.0);
for (int pos = len; pos >= 0; --pos) {
for (Node *lnode : end_nodes_[pos]) {
for (Node *rnode : begin_nodes_[pos]) {
beta[lnode->node_id] =
LogSumExp(beta[lnode->node_id], rnode->score + beta[rnode->node_id],
rnode == begin_nodes_[pos][0]);
}
}
}
return beta;
}
float Lattice::PopulateMarginal(float freq,
std::vector *expected) const {
if (expected == nullptr) return 0.0;
const int len = size();
// alpha and beta (accumulative log prob) in Forward Backward.
// the index of alpha/beta is Node::node_id.
const auto alpha = ForwardAlgorithm(1.0);
const auto beta = BackwardAlgorithm(1.0);
const float Z = alpha[begin_nodes_[len][0]->node_id];
for (int pos = 0; pos < len; ++pos) {
for (Node *node : begin_nodes_[pos]) {
if (node->id >= 0) {
// the index of |expected| is a Node::id, which is a vocabulary id.
(*expected)[node->id] +=
freq *
std::exp(static_cast(alpha[node->node_id] + node->score +
beta[node->node_id] - Z));
}
}
}
return freq * Z;
}
float Lattice::CalculateEntropy(float theta) const {
const int len = size();
// alpha[node_id] is the marginal prob of sequence up to start of node
// H is entropy of sequence
// the index of alpha/H is Node::node_id.
std::vector alpha(node_allocator_.size(), 0.0);
std::vector H(node_allocator_.size(), 0.0);
// Populate the forward marginals to get the normalising constant
alpha = ForwardAlgorithm(theta);
// Now populate the forward entropies
for (int pos = 0; pos <= len; ++pos) {
for (Node *rnode : begin_nodes_[pos]) {
for (Node *lnode : end_nodes_[pos]) {
// Contribution each lnode makes = p(lnode) * (H(lnode) + log p(lnode))
// We have to normalise p(lnode) by the marginal contribution it makes
const float lnode_transition_prob =
((theta * lnode->score) + alpha[lnode->node_id] -
alpha[rnode->node_id]);
H[rnode->node_id] += std::exp(lnode_transition_prob) *
(H[lnode->node_id] + lnode_transition_prob);
}
}
}
return -H[begin_nodes_[len][0]->node_id];
}
std::vector Lattice::NBest(size_t nbest_size,
bool sample,
float theta) {
if (nbest_size < 1) {
LOG(WARNING) << "nbest_size >= 1. Returns empty result.";
return {};
}
if (nbest_size == 1 && !sample) {
return {Viterbi()};
}
// Uses A* search to enumerate N-bests.
// Given a lattice, enumerates hypotheses (paths) from EOS.
// At each partial path x, compute f(x) as follows
// f(x) = g(x) + h(x).
// g(x): the sum of scores from EOS to the left-most node in x.
// for a complete hypothesis, g(hyp) is the score of the hypothesis.
// h(x): a heuristic that estimates the largest score from x to BOS.
// f(x): the priority to pop a new hypothesis from the priority queue.
//
// As left-to-right Viterbi search can tell the *exact* value of h(x),
// we can obtain the exact n-best results with A*.
struct Hypothesis {
Node *node;
Hypothesis *next;
float fx;
float gx;
};
class HypothesisComparator {
public:
const bool operator()(Hypothesis *h1, Hypothesis *h2) {
return (h1->fx < h2->fx);
}
};
using Agenda = std::priority_queue,
HypothesisComparator>;
constexpr size_t kPreallocatedHypothesisSize = 512;
model::FreeList hypothesis_allocator(kPreallocatedHypothesisSize);
Agenda agenda;
std::vector results;
auto *eos = hypothesis_allocator.Allocate();
eos->node = eos_node();
eos->next = nullptr;
eos->gx = 0.0;
std::vector alpha(node_allocator_.size(), 0.0);
if (sample) {
// Run forwards algorithm to get normalising constants
alpha = ForwardAlgorithm(theta);
// f(eos) = Gumbel(0), as it is the perturbed score of the entire lattice.
eos->fx = Gumbel();
} else {
// Run Viterbi first to fill backtrace score.
Viterbi();
eos->fx = eos->node->backtrace_score;
}
agenda.push(eos);
while (!agenda.empty()) {
auto *top = agenda.top();
agenda.pop();
auto *node = top->node;
// Reaches to BOS
if (node == bos_node()) {
results.resize(results.size() + 1);
for (auto *n = top->next; n->next != nullptr; n = n->next) {
results.back().first.push_back(n->node);
}
results.back().second = top->fx;
if (results.size() == nbest_size) {
break;
}
continue;
}
const int end_nodes_size = end_nodes(node->pos).size();
std::vector probs(end_nodes_size, 0.0);
std::vector perturbed_probs(end_nodes_size, 0.0);
std::vector adjusted_probs(end_nodes_size, 0.0);
const float Z = alpha[node->node_id];
if (sample) {
float max_score = -1e8;
// Calculate the marginal and perturbed scores for stochastic search
for (int i = 0; i < end_nodes(node->pos).size(); i++) {
Node *lnode = end_nodes(node->pos)[i];
// Calculate backwards transition score
probs[i] = top->gx + alpha[lnode->node_id] + (theta * lnode->score) - Z;
perturbed_probs[i] = probs[i] + Gumbel();
if (perturbed_probs[i] > max_score) {
max_score = perturbed_probs[i];
}
}
// Now constrain the sampled continuations to match the score of parent
for (int i = 0; i < adjusted_probs.size(); i++) {
// Use numerically stable version of truncated Gumbel:
// https://arxiv.org/pdf/1903.06059.pdf appendix B.3
const float v = top->fx - perturbed_probs[i] +
std::log1p(-std::exp(perturbed_probs[i] - max_score));
adjusted_probs[i] = top->fx - std::max(static_cast(0.0), v) -
std::log1p(std::exp(-std::abs(v)));
}
}
// Expands new node ending at node->pos
for (int i = 0; i < end_nodes(node->pos).size(); i++) {
Node *lnode = end_nodes(node->pos)[i];
auto *hyp = hypothesis_allocator.Allocate();
hyp->node = lnode;
if (sample) {
hyp->gx = probs[i];
hyp->fx = adjusted_probs[i];
} else {
hyp->gx = lnode->score + top->gx; // just adds node->score
hyp->fx =
lnode->backtrace_score + top->gx; // backtrace_score is h(node).
}
hyp->next = top;
agenda.push(hyp);
}
// When the input is too long or contains duplicated phrases,
// `agenda` will get extremely big. Here we avoid this case by
// dynamically shrinking the agenda.
constexpr int kMaxAgendaSize = 100000;
constexpr int kMinAgendaSize = 512;
if (agenda.size() >= kMaxAgendaSize) {
LOG(WARNING) << "Too big agenda. shrinking";
// Keeps the top `kMinAgendaSize` hypothesis.
Agenda new_agenda;
const int size = std::min(kMinAgendaSize, nbest_size * 10);
for (int i = 0; i < size; ++i) {
new_agenda.push(agenda.top());
agenda.pop();
}
agenda = std::move(new_agenda);
}
}
return results;
}
std::vector Lattice::Sample(float theta) {
const int len = size();
if (len == 0) return {};
std::vector alpha(node_allocator_.size(), 0.0);
alpha = ForwardAlgorithm(theta);
auto *mt = random::GetRandomGenerator();
std::vector results;
std::vector probs;
float Z = alpha[eos_node()->node_id];
Node *node = eos_node();
while (true) {
probs.clear();
for (const Node *lnode : end_nodes_[node->pos]) {
probs.push_back(std::exp(static_cast(alpha[lnode->node_id] +
theta * lnode->score - Z)));
}
std::discrete_distribution dist(probs.begin(), probs.end());
node = end_nodes_[node->pos][dist(*mt)];
if (node == bos_node()) break;
Z = alpha[node->node_id];
results.push_back(node);
}
std::reverse(results.begin(), results.end());
return results;
}
// Model::Model() {}
// Model::~Model() {}
void Model::PopulateNodes(Lattice *lattice) const {
auto get_chars_length = [&lattice](int begin_pos, const char *end) {
int pos = begin_pos;
while (lattice->surface(pos) < end) ++pos;
return pos - begin_pos;
};
const float unk_score = min_score() - kUnkPenalty;
const int len = lattice->size();
const char *end = lattice->sentence() + lattice->utf8_size();
// +1 just in case.
std::vector trie_results(
trie_results_size_ + 1);
for (int begin_pos = 0; begin_pos < len; ++begin_pos) {
const char *begin = lattice->surface(begin_pos);
// Finds all pieces which are prefix of surface(begin_pos).
const size_t num_nodes = trie_->commonPrefixSearch(
begin, trie_results.data(), trie_results.size(),
static_cast(end - begin));
CHECK_LT(num_nodes, trie_results.size());
bool has_single_node = false;
// Inserts pieces to the lattice.
for (size_t k = 0; k < num_nodes; ++k) {
const int length =
get_chars_length(begin_pos, begin + trie_results[k].length);
const int id = trie_results[k].value;
if (IsUnusedInlined(id)) continue;
Lattice::Node *node = lattice->Insert(begin_pos, length);
node->id = id; // the value of Trie stores vocab_id.
// User defined symbol receives extra bonus to always be selected.
node->score = IsUserDefinedInlined(id) ? (length * max_score_ - 0.1)
: GetScoreInlined(id);
if (!has_single_node && node->length == 1) {
has_single_node = true;
}
}
if (!has_single_node) {
Lattice::Node *node = lattice->Insert(begin_pos, 1);
node->id = unk_id_; // add UNK node.
node->score = unk_score;
}
}
}
int Model::PieceToId(absl::string_view piece) const {
auto it = reserved_id_map_.find(piece);
if (it != reserved_id_map_.end()) {
return it->second;
}
int id = 0;
trie_->exactMatchSearch(piece.data(), id, piece.size());
return id == -1 ? unk_id_ : id;
}
void Model::BuildTrie(std::vector> *pieces) {
if (!status().ok()) return;
if (pieces->empty()) {
status_ = util::InternalError("no pieces are loaded.");
return;
}
// sort by sentencepiece since DoubleArray::build()
// only accepts sorted strings.
sort(pieces->begin(), pieces->end());
// Makes key/value set for DoubleArrayTrie.
std::vector key(pieces->size());
std::vector value(pieces->size());
for (size_t i = 0; i < pieces->size(); ++i) {
key[i] = (*pieces)[i].first.data(); // sorted piece.
value[i] = (*pieces)[i].second; // vocab_id
}
trie_ = absl::make_unique();
if (trie_->build(key.size(), const_cast(&key[0]), nullptr,
&value[0]) != 0) {
status_ = util::InternalError("cannot build double-array.");
return;
}
// Computes the maximum number of shared prefixes in the trie.
const int kMaxTrieResultsSize = 1024;
std::vector results(
kMaxTrieResultsSize);
trie_results_size_ = 0;
for (const auto &p : *pieces) {
const int num_nodes = trie_->commonPrefixSearch(
p.first.data(), results.data(), results.size(), p.first.size());
trie_results_size_ = std::max(trie_results_size_, num_nodes);
}
pieces_.clear();
if (trie_results_size_ == 0)
status_ = util::InternalError("no entry is found in the trie.");
}
Model::Model(const ModelProto &model_proto) {
model_proto_ = &model_proto;
InitializePieces();
min_score_ = FLT_MAX;
max_score_ = FLT_MIN;
for (const auto &sp : model_proto_->pieces()) {
if (sp.type() == ModelProto::SentencePiece::NORMAL) {
min_score_ = std::min(min_score_, sp.score());
max_score_ = std::max(max_score_, sp.score());
}
}
std::vector> pieces;
for (const auto &it : pieces_) pieces.emplace_back(it.first, it.second);
BuildTrie(&pieces);
}
Model::~Model() {}
EncodeResult Model::Encode(absl::string_view normalized) const {
if (encoder_version_ == EncoderVersion::kOptimized) {
return EncodeOptimized(normalized);
}
if (!status().ok() || normalized.empty()) {
return {};
}
Lattice lattice;
lattice.SetSentence(normalized);
PopulateNodes(&lattice);
EncodeResult results;
for (const auto *node : lattice.Viterbi().first) {
results.emplace_back(node->piece, node->id);
}
return results;
}
NBestEncodeResult Model::NBestEncode(absl::string_view normalized,
int nbest_size) const {
if (!status().ok() || normalized.empty()) {
return {{{}, 0.0}};
}
nbest_size = std::max(1, std::min(nbest_size, 1024));
Lattice lattice;
lattice.SetSentence(normalized);
PopulateNodes(&lattice);
NBestEncodeResult nbest_results;
for (const auto &nbest : lattice.NBest(nbest_size, false, 0.0)) {
EncodeResult results;
for (const auto *node : nbest.first) {
results.emplace_back(node->piece, node->id);
}
nbest_results.emplace_back(results, nbest.second);
}
return nbest_results;
}
EncodeResult Model::SampleEncode(absl::string_view normalized,
float theta) const {
if (!status().ok() || normalized.empty()) {
return {};
}
Lattice lattice;
lattice.SetSentence(normalized);
PopulateNodes(&lattice);
EncodeResult results;
for (const auto *node : lattice.Sample(theta)) {
results.emplace_back(node->piece, node->id);
}
return results;
}
NBestEncodeResult Model::SampleEncodeAndScore(absl::string_view normalized,
float theta, int samples,
bool wor,
bool include_best) const {
if (!status().ok() || normalized.empty()) {
return {};
}
NBestEncodeResult results;
Lattice lattice;
lattice.SetSentence(normalized);
PopulateNodes(&lattice);
std::vector alpha = lattice.ForwardAlgorithm(theta);
float marginal = alpha[lattice.eos_node()->node_id];
if (include_best) {
if (!wor) {
LOG(FATAL) << "include_best not supported for wor false";
}
EncodeResult result;
Lattice::LatticePathWithScore best_path = lattice.Viterbi();
for (const auto *node : best_path.first) {
result.emplace_back(node->piece, node->id);
}
// Inclusion probability if we always include the best is 1.
results.emplace_back(result, 0.0);
}
if (wor) {
// Draw k+1 samples as we need perturbed score of k+1th element
std::vector nbest_samples =
lattice.NBest(samples + 1, true, theta);
if (include_best) {
std::vector> nbest_paths(
nbest_samples.size());
for (int i = 0; i < nbest_samples.size(); i++) {
nbest_paths[i] = nbest_samples[i].first;
}
// Remove the best result from the samples if necessary
Lattice::LatticePathWithScore best_path = lattice.Viterbi();
const int index_of_best =
(std::find(nbest_paths.begin(), nbest_paths.end(), best_path.first) -
nbest_paths.begin());
if (index_of_best != nbest_samples.size()) {
LOG(INFO) << "removing best path from samples";
nbest_samples.erase(nbest_samples.begin() + index_of_best);
} else {
nbest_samples.pop_back();
}
}
// We use the perturbed score of the k+1th element to calculate the
// inclusion probability.
const double kappa = static_cast(nbest_samples.back().second);
// Discard the last sample
nbest_samples.pop_back();
for (const auto &nbest : nbest_samples) {
EncodeResult result;
float score = 0.0;
for (const auto *node : nbest.first) {
score += (theta * node->score);
result.emplace_back(node->piece, node->id);
}
results.emplace_back(result, score - marginal);
}
// Now calculate the inclusion probability
for (auto &it : results) {
// Only modify non best sample inclusion probabilities.
if (it.second != 0.0) {
double x = it.second - kappa;
double y = std::exp(x);
double inclusion_prob;
if (x <= -10) {
// Series expansion of the log Gumbel survival function up to eps.
inclusion_prob =
x - (y / 2) + (std::pow(y, 2) / 24) - std::pow(y, 4) / 2880;
} else {
inclusion_prob = std::log(-std::expm1(-y));
}
it.second = static_cast(inclusion_prob);
}
}
} else {
while (results.size() < samples) {
Lattice lattice;
lattice.SetSentence(normalized);
PopulateNodes(&lattice);
float score = 0.0;
EncodeResult result;
std::vector sample = lattice.Sample(theta);
for (const auto *node : sample) {
result.emplace_back(node->piece, node->id);
score += (theta * node->score);
}
results.emplace_back(result, score - marginal);
}
}
return results;
}
float Model::CalculateEntropy(absl::string_view normalized, float theta) const {
Lattice lattice;
lattice.SetSentence(normalized);
PopulateNodes(&lattice);
return lattice.CalculateEntropy(theta);
}
bool Model::VerifyOutputsEquivalent(absl::string_view expected,
absl::string_view actual) const {
auto compute_unigram_model_score =
[this](std::vector output_pieces) {
float total_score = 0;
const float unk_score = min_score() - kUnkPenalty;
for (const auto p : output_pieces) {
const auto id = PieceToId(p);
if (id == unk_id_) {
total_score += unk_score;
} else {
const int length = p.size();
total_score += IsUserDefinedInlined(id)
? (length * max_score_ - 0.1)
: GetScoreInlined(id);
}
}
return total_score;
};
const auto expected_score =
compute_unigram_model_score(absl::StrSplit(expected, ' '));
const auto actual_score =
compute_unigram_model_score(absl::StrSplit(actual, ' '));
if (std::abs(expected_score - actual_score) > kEpsilon) {
LOG(WARNING) << "Two sentence piece sequences are not equivalent! Left: "
<< expected << ", Score: " << expected_score
<< ". Right: " << actual << ", Score: " << actual_score << ".";
return false;
}
return true;
}
EncodeResult Model::EncodeOptimized(absl::string_view normalized) const {
// An optimized Viterbi algorithm for unigram language models. Benchmarking
// results show that it generates almost identical outputs and achieves 2.1x
// speedup on average for 102 languages compared to the original
// implementation. It's based on the following three ideas:
//
// 1. Because it uses the *unigram* model:
// best_score(x1, x2, …, xt) = best_score(x1, x2, …, x{t-1}) + score(xt)
// Deciding the best path (and score) can be decoupled into two isolated
// terms: (a) the best path ended before the last token `best_score(x1, x2, …,
// x{t-1})`, and (b) the last token and its `score(xt)`. The two terms are
// not related to each other at all.
//
// Therefore, we can compute once and store the *best_path ending at
// each character position*. In this way, when we know best_path_ends_at[M],
// we can reuse it to compute all the best_path_ends_at_[...] where the last
// token starts at the same character position M.
//
// This improves the time complexity from O(n*k*k) to O(n*k) because it
// eliminates the extra loop of recomputing the best path ending at the same
// position, where n is the input length and k is the maximum number of tokens
// that can be recognized starting at each position.
//
// 2. Again, because it uses the *unigram* model, we don’t need to actually
// store the lattice nodes. We still recognize all the tokens and lattice
// nodes from the input, but along identifying them, we use and discard them
// on the fly. There is no need to actually store them for best path Viterbi
// decoding. The only thing we need to store is the best_path ending at
// each character position.
//
// This improvement reduces the things needed to store in memory from O(n*k)
// to O(n), where n is the input length and k is the maximum number of tokens
// that can be recognized starting at each position.
//
// It also avoids the need of dynamic-size lattice node pool, because the
// number of things to store is fixed as n.
//
// 3. SentencePiece is designed to work with unicode, taking utf-8 encoding
// inputs. In the original implementation, the lattice positions are based on
// unicode positions. A mapping from unicode position to the utf-8 position is
// maintained to recover the utf-8 string piece.
//
// We found that it is sufficient and beneficial to directly work with utf-8
// positions:
//
// Firstly, it saves the conversion and mapping between unicode positions and
// utf-8 positions.
//
// Secondly, it reduces the number of fields we need to maintain in the
// node/path structure. Specifically, there are 8 fields defined in
// `Lattice::Node` used by the original encoder, but here in the optimized
// encoder we only need to define 3 fields in `BestPathNode`.
if (!status().ok() || normalized.empty()) {
return {};
}
// Represents the last node of the best path.
struct BestPathNode {
int id = -1; // The vocab id. (maybe -1 for UNK)
float best_path_score =
0; // The total score of the best path ending at this node.
int starts_at =
-1; // The starting position (in utf-8) of this node. The entire best
// path can be constructed by backtracking along this link.
};
const int size = normalized.size();
const float unk_score = min_score() - kUnkPenalty;
// The ends are exclusive.
std::vector best_path_ends_at(size + 1);
// Generate lattice on-the-fly (not stored) and update best_path_ends_at.
int starts_at = 0;
while (starts_at < size) {
std::size_t node_pos = 0;
std::size_t key_pos = starts_at;
const auto best_path_score_till_here =
best_path_ends_at[starts_at].best_path_score;
bool has_single_node = false;
const int mblen =
std::min(string_util::OneCharLen(normalized.data() + starts_at),
size - starts_at);
while (key_pos < size) {
const int ret =
trie_->traverse(normalized.data(), node_pos, key_pos, key_pos + 1);
if (ret == -2) break;
if (ret >= 0) {
if (IsUnusedInlined(ret)) continue;
// Update the best path node.
auto &target_node = best_path_ends_at[key_pos];
const auto length = (key_pos - starts_at);
// User defined symbol receives extra bonus to always be selected.
const auto score = IsUserDefinedInlined(ret)
? (length * max_score_ - 0.1)
: GetScoreInlined(ret);
const auto candidate_best_path_score =
score + best_path_score_till_here;
if (target_node.starts_at == -1 ||
candidate_best_path_score > target_node.best_path_score) {
target_node.best_path_score = candidate_best_path_score;
target_node.starts_at = starts_at;
target_node.id = ret;
}
if (!has_single_node && length == mblen) {
has_single_node = true;
}
}
}
if (!has_single_node) {
auto &target_node = best_path_ends_at[starts_at + mblen];
const auto candidate_best_path_score =
unk_score + best_path_score_till_here;
if (target_node.starts_at == -1 ||
candidate_best_path_score > target_node.best_path_score) {
target_node.best_path_score = candidate_best_path_score;
target_node.starts_at = starts_at;
target_node.id = unk_id_;
}
}
// Move by one unicode character.
starts_at += mblen;
}
// Backtrack to identify the best path.
EncodeResult results;
int ends_at = size;
while (ends_at > 0) {
const auto &node = best_path_ends_at[ends_at];
results.emplace_back(
normalized.substr(node.starts_at, ends_at - node.starts_at), node.id);
ends_at = node.starts_at;
}
std::reverse(results.begin(), results.end());
return results;
}
} // namespace unigram
} // namespace sentencepiece
sentencepiece-0.1.96/src/common.h 0000644 0001750 0000176 00000013567 14062671741 016230 0 ustar kenhys docker // Copyright 2016 Google Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.!
#ifndef COMMON_H_
#define COMMON_H_
#include
#include
#include
#include
#include
#include
#include
#include
#include "config.h"
#if defined(_WIN32) && !defined(__CYGWIN__)
#define OS_WIN
#else
#define OS_UNIX
#endif
#ifdef OS_WIN
#ifndef NOMINMAX
#define NOMINMAX
#endif
#include
#endif
typedef int8_t int8;
typedef int16_t int16;
typedef int32_t int32;
typedef int64_t int64;
typedef uint8_t uint8;
typedef uint16_t uint16;
typedef uint32_t char32;
typedef uint32_t uint32;
typedef uint64_t uint64;
static constexpr uint8 kuint8max = ((uint8)0xFF);
static constexpr uint16 kuint16max = ((uint16)0xFFFF);
static constexpr uint32 kuint32max = ((uint32)0xFFFFFFFF);
static constexpr uint64 kuint64max = ((uint64)(0xFFFFFFFFFFFFFFFF));
static constexpr int8 kint8min = ((int8)~0x7F);
static constexpr int8 kint8max = ((int8)0x7F);
static constexpr int16 kint16min = ((int16)~0x7FFF);
static constexpr int16 kint16max = ((int16)0x7FFF);
static constexpr int32 kint32min = ((int32)~0x7FFFFFFF);
static constexpr int32 kint32max = ((int32)0x7FFFFFFF);
static constexpr int64 kint64min = ((int64)(~0x7FFFFFFFFFFFFFFF));
static constexpr int64 kint64max = ((int64)(0x7FFFFFFFFFFFFFFF));
static constexpr uint32 kUnicodeError = 0xFFFD;
#if defined(OS_WIN) && defined(UNICODE) && defined(_UNICODE)
#define WPATH(path) (::sentencepiece::win32::Utf8ToWide(path).c_str())
#else
#define WPATH(path) (path)
#endif
template
char (&ArraySizeHelper(T (&array)[N]))[N];
#ifndef _MSC_VER
template
char (&ArraySizeHelper(const T (&array)[N]))[N];
#endif // !_MSC_VER
#define arraysize(array) (sizeof(ArraySizeHelper(array)))
namespace sentencepiece {
#ifdef OS_WIN
namespace win32 {
std::wstring Utf8ToWide(const std::string &input);
std::string WideToUtf8(const std::wstring &input);
} // namespace win32
#endif
namespace error {
void Abort();
void Exit(int code);
void SetTestCounter(int c);
void ResetTestMode();
bool GetTestCounter();
class Die {
public:
explicit Die(bool die) : die_(die) {}
~Die() {
std::cerr << std::endl;
if (die_) {
Abort();
}
}
int operator&(std::ostream &) { return 0; }
private:
bool die_;
};
template
T &&CheckNotNull(const char *file, int line, const char *exprtext, T &&t) {
if (t == nullptr) {
std::cerr << file << "(" << line << ") " << exprtext;
Abort();
}
return std::forward(t);
}
} // namespace error
namespace logging {
enum LogSeverity {
LOG_INFO = 0,
LOG_WARNING = 1,
LOG_ERROR = 2,
LOG_FATAL = 3,
LOG_SEVERITY_SIZE = 4,
};
int GetMinLogLevel();
void SetMinLogLevel(int v);
inline const char *BaseName(const char *path) {
#ifdef OS_WIN
const char *p = strrchr(path, '\\');
#else
const char *p = strrchr(path, '/');
#endif
if (p == nullptr) return path;
return p + 1;
}
} // namespace logging
} // namespace sentencepiece
#define LOG(severity) \
(::sentencepiece::logging::GetMinLogLevel() > \
::sentencepiece::logging::LOG_##severity) \
? 0 \
: ::sentencepiece::error::Die( \
::sentencepiece::logging::LOG_##severity >= \
::sentencepiece::logging::LOG_FATAL) & \
std::cerr << ::sentencepiece::logging::BaseName(__FILE__) << "(" \
<< __LINE__ << ") " \
<< "LOG(" << #severity << ") "
#define CHECK(condition) \
(condition) ? 0 \
: ::sentencepiece::error::Die(true) & \
std::cerr << ::sentencepiece::logging::BaseName(__FILE__) \
<< "(" << __LINE__ << ") [" << #condition \
<< "] "
#define CHECK_STREQ(a, b) CHECK_EQ(std::string(a), std::string(b))
#define CHECK_EQ(a, b) CHECK((a) == (b))
#define CHECK_NE(a, b) CHECK((a) != (b))
#define CHECK_GE(a, b) CHECK((a) >= (b))
#define CHECK_LE(a, b) CHECK((a) <= (b))
#define CHECK_GT(a, b) CHECK((a) > (b))
#define CHECK_LT(a, b) CHECK((a) < (b))
#define CHECK_NOTNULL(val) \
::sentencepiece::error::CheckNotNull( \
::sentencepiece::logging::BaseName(__FILE__), __LINE__, \
"'" #val "' Must be non NULL", (val))
#define FRIEND_TEST(a, b) friend class a##_Test_##b;
#define CHECK_OK(expr) \
do { \
const auto _status = expr; \
CHECK(_status.ok()) << _status.ToString(); \
} while (0)
#define CHECK_NOT_OK(expr) \
do { \
const auto _status = expr; \
CHECK(!_status.ok()) << _status.ToString(); \
} while (0)
#define RETURN_IF_ERROR(expr) \
do { \
const auto _status = expr; \
if (!_status.ok()) return _status; \
} while (0)
#endif // COMMON_H_
sentencepiece-0.1.96/src/sentencepiece.proto 0000644 0001750 0000176 00000004733 14062671741 020461 0 ustar kenhys docker // Copyright 2016 Google Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.!
syntax = "proto2";
// TODO(taku): Needs to use LITE RUNTIME in OSS release.
option optimize_for = LITE_RUNTIME;
package sentencepiece;
// SentencePieceText manages a user-facing source sentence,
// postprocessed target sentence, and internal segmentation
// with byte offsets.
message SentencePieceText {
message SentencePiece {
// Internal representation for the decoder.
// - Decoder can use |piece| as a basic token.
// - the piece must be non-empty.
// - A whitespace is replaced with a meta symbol.
// - Concatenation of pieces is not always the same as the |text|.
optional string piece = 1;
// Vocabulary id.
optional uint32 id = 2;
// External representation for the client.
// - It is always guaranteed that
// text.substr(begin, end - begin) == surface.
// - Concatenation of surface is always the same as the |text|.
// - |surface| may contain whitespaces.
// - |surface| may be empty if the piece encodes
// a control vocabulary. e.g., , , .
// - When |surface| is empty, always begin == end. (zero-length span).
optional string surface = 3;
optional uint32 begin = 4;
optional uint32 end = 5;
// Customized extensions: the range of field numbers
// are open to third-party extensions.
extensions 200 to max;
}
// User input or postprocessed text. This should be immutable
// since the byte range in SentencePiece is pointing to a span over this
// text. Meta symbols for whitespaces are not included.
optional string text = 1;
// A sequence of sentence pieces.
repeated SentencePiece pieces = 2;
// Score (usually log probability) for MultiSentencePieceText.
optional float score = 3;
// Customized extensions: the range of field numbers
// are open to third-party extensions.
extensions 200 to max;
}
message NBestSentencePieceText {
repeated SentencePieceText nbests = 1;
}
sentencepiece-0.1.96/src/sentencepiece_trainer.cc 0000644 0001750 0000176 00000023356 14062671741 021431 0 ustar kenhys docker // Copyright 2018 Google Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.!
#include
#include
#include "builder.h"
#include "common.h"
#include "normalizer.h"
#include "sentencepiece.pb.h"
#include "sentencepiece_model.pb.h"
#include "sentencepiece_trainer.h"
#include "spec_parser.h"
#include "third_party/absl/flags/flag.h"
#include "third_party/absl/strings/numbers.h"
#include "third_party/absl/strings/str_cat.h"
#include "third_party/absl/strings/str_split.h"
#include "third_party/absl/strings/string_view.h"
#include "third_party/absl/strings/strip.h"
#include "trainer_factory.h"
#include "util.h"
namespace sentencepiece {
namespace {
static constexpr char kDefaultNormalizerName[] = "nmt_nfkc";
} // namespace
// static
util::Status SentencePieceTrainer::Train(const TrainerSpec &trainer_spec,
SentenceIterator *sentence_iterator,
std::string *serialized_model_proto) {
NormalizerSpec normalizer_spec;
return Train(trainer_spec, normalizer_spec, sentence_iterator,
serialized_model_proto);
}
util::Status SentencePieceTrainer::Train(const TrainerSpec &trainer_spec,
const NormalizerSpec &normalizer_spec,
SentenceIterator *sentence_iterator,
std::string *serialized_model_proto) {
NormalizerSpec denormalizer_spec;
return Train(trainer_spec, normalizer_spec, denormalizer_spec,
sentence_iterator, serialized_model_proto);
}
// static
util::Status SentencePieceTrainer::Train(
const TrainerSpec &trainer_spec, const NormalizerSpec &normalizer_spec,
const NormalizerSpec &denormalizer_spec,
SentenceIterator *sentence_iterator, std::string *serialized_model_proto) {
auto copied_normalizer_spec = normalizer_spec;
RETURN_IF_ERROR(PopulateNormalizerSpec(&copied_normalizer_spec, false));
auto copied_denormalizer_spec = denormalizer_spec;
RETURN_IF_ERROR(PopulateNormalizerSpec(&copied_denormalizer_spec, true));
auto trainer = TrainerFactory::Create(trainer_spec, copied_normalizer_spec,
copied_denormalizer_spec);
std::string info =
absl::StrCat(PrintProto(trainer_spec, "trainer_spec"),
PrintProto(copied_normalizer_spec, "normalizer_spec"));
if (!copied_denormalizer_spec.precompiled_charsmap().empty()) {
info += PrintProto(copied_denormalizer_spec, "denormalizer_spec");
} else {
info += "denormalizer_spec {}";
}
LOG(INFO) << "Starts training with : \n" << info;
if (serialized_model_proto) {
ModelProto model_proto;
RETURN_IF_ERROR(trainer->Train(sentence_iterator, &model_proto));
*serialized_model_proto = model_proto.SerializeAsString();
} else {
RETURN_IF_ERROR(trainer->Train(sentence_iterator, nullptr));
}
return util::OkStatus();
}
// static
NormalizerSpec SentencePieceTrainer::GetNormalizerSpec(absl::string_view name) {
NormalizerSpec spec;
spec.set_name(name.data(), name.size());
CHECK_OK(normalizer::Builder::GetPrecompiledCharsMap(
spec.name(), spec.mutable_precompiled_charsmap()));
return spec;
}
// static
util::Status SentencePieceTrainer::MergeSpecsFromArgs(
absl::string_view args, TrainerSpec *trainer_spec,
NormalizerSpec *normalizer_spec, NormalizerSpec *denormalizer_spec) {
CHECK_OR_RETURN(trainer_spec) << "`trainer_spec` must not be null.";
CHECK_OR_RETURN(normalizer_spec) << "`normalizer_spec` must not be null.";
CHECK_OR_RETURN(denormalizer_spec) << "`denormalizer_spec` must not be null.";
if (args.empty()) return util::OkStatus();
std::unordered_map kwargs;
for (auto arg : absl::StrSplit(args, " ")) {
absl::ConsumePrefix(&arg, "--");
std::string key, value;
const auto pos = arg.find('=');
if (pos == absl::string_view::npos) {
key = std::string(arg);
} else {
key = std::string(arg.substr(0, pos));
value = std::string(arg.substr(pos + 1));
}
kwargs.emplace(key, value);
}
return MergeSpecsFromArgs(kwargs, trainer_spec, normalizer_spec,
denormalizer_spec);
}
// static
util::Status SentencePieceTrainer::MergeSpecsFromArgs(
const std::unordered_map &kwargs,
TrainerSpec *trainer_spec, NormalizerSpec *normalizer_spec,
NormalizerSpec *denormalizer_spec) {
CHECK_OR_RETURN(trainer_spec) << "`trainer_spec` must not be null.";
CHECK_OR_RETURN(normalizer_spec) << "`normalizer_spec` must not be null.";
CHECK_OR_RETURN(denormalizer_spec) << "`denormalizer_spec` must not be null.";
for (const auto &it : kwargs) {
const auto &key = it.first;
const auto &value = it.second;
// Exceptions.
if (key == "normalization_rule_name") {
normalizer_spec->set_name(value);
continue;
} else if (key == "denormalization_rule_tsv") {
denormalizer_spec->set_normalization_rule_tsv(value);
denormalizer_spec->set_add_dummy_prefix(false);
denormalizer_spec->set_remove_extra_whitespaces(false);
denormalizer_spec->set_escape_whitespaces(false);
continue;
} else if (key == "minloglevel") {
int v = 0;
CHECK_OR_RETURN(absl::SimpleAtoi(value, &v));
logging::SetMinLogLevel(v);
continue;
}
const auto status_train = SetProtoField(key, value, trainer_spec);
if (status_train.ok()) continue;
if (!util::IsNotFound(status_train)) return status_train;
const auto status_norm = SetProtoField(key, value, normalizer_spec);
if (status_norm.ok()) continue;
if (!util::IsNotFound(status_norm)) return status_norm;
// Not found both in trainer_spec and normalizer_spec.
if (util::IsNotFound(status_train) && util::IsNotFound(status_norm)) {
return status_train;
}
}
return util::OkStatus();
}
// static
util::Status SentencePieceTrainer::Train(absl::string_view args,
SentenceIterator *sentence_iterator,
std::string *serialized_model_proto) {
LOG(INFO) << "Running command: " << args.data();
TrainerSpec trainer_spec;
NormalizerSpec normalizer_spec;
NormalizerSpec denormalizer_spec;
RETURN_IF_ERROR(MergeSpecsFromArgs(args, &trainer_spec, &normalizer_spec,
&denormalizer_spec));
return Train(trainer_spec, normalizer_spec, denormalizer_spec,
sentence_iterator, serialized_model_proto);
}
// static
util::Status SentencePieceTrainer::Train(
const std::unordered_map &kwargs,
SentenceIterator *sentence_iterator, std::string *serialized_model_proto) {
TrainerSpec trainer_spec;
NormalizerSpec normalizer_spec;
NormalizerSpec denormalizer_spec;
RETURN_IF_ERROR(MergeSpecsFromArgs(kwargs, &trainer_spec, &normalizer_spec,
&denormalizer_spec));
return Train(trainer_spec, normalizer_spec, denormalizer_spec,
sentence_iterator, serialized_model_proto);
}
// static
util::Status SentencePieceTrainer::PopulateNormalizerSpec(
NormalizerSpec *normalizer_spec, bool is_denormalizer) {
CHECK_OR_RETURN(normalizer_spec);
if (!normalizer_spec->normalization_rule_tsv().empty()) {
CHECK_OR_RETURN(normalizer_spec->precompiled_charsmap().empty())
<< "precompiled_charsmap is already defined.";
normalizer::Builder::CharsMap chars_map;
RETURN_IF_ERROR(normalizer::Builder::LoadCharsMap(
normalizer_spec->normalization_rule_tsv(), &chars_map));
RETURN_IF_ERROR(normalizer::Builder::CompileCharsMap(
chars_map, normalizer_spec->mutable_precompiled_charsmap()));
normalizer_spec->set_name("user_defined");
} else if (!is_denormalizer) {
if (normalizer_spec->name().empty()) {
normalizer_spec->set_name(kDefaultNormalizerName);
}
if (normalizer_spec->precompiled_charsmap().empty()) {
RETURN_IF_ERROR(normalizer::Builder::GetPrecompiledCharsMap(
normalizer_spec->name(),
normalizer_spec->mutable_precompiled_charsmap()));
}
}
return util::OkStatus();
}
// static
util::Status SentencePieceTrainer::PopulateModelTypeFromString(
absl::string_view type, TrainerSpec *spec) {
static const std::unordered_map
kModelTypeMap = {{"unigram", TrainerSpec::UNIGRAM},
{"bpe", TrainerSpec::BPE},
{"word", TrainerSpec::WORD},
{"char", TrainerSpec::CHAR}};
const auto it = kModelTypeMap.find(absl::AsciiStrToLower(type));
if (it != kModelTypeMap.end()) {
spec->set_model_type(it->second);
return util::OkStatus();
}
return util::StatusBuilder(util::StatusCode::kInternal, GTL_LOC)
<< "\"" << type << "\" is not found in TrainerSpec";
}
namespace {
const pretokenizer::PretokenizerForTrainingInterface *g_pretokenizer = nullptr;
} // namespace
// static
util::Status SentencePieceTrainer::SetPretokenizerForTraining(
const pretokenizer::PretokenizerForTrainingInterface *pretokenizer) {
g_pretokenizer = pretokenizer;
return util::OkStatus();
}
// static
const pretokenizer::PretokenizerForTrainingInterface *
SentencePieceTrainer::GetPretokenizerForTraining() {
return g_pretokenizer;
}
} // namespace sentencepiece
sentencepiece-0.1.96/src/testharness.cc 0000644 0001750 0000176 00000003553 14062671741 017433 0 ustar kenhys docker // Copyright 2016 Google Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.!
#include "testharness.h"
#ifndef OS_WIN
#include
#include
#else
#include
#endif
#include
#include
#include
#include "common.h"
#include "third_party/absl/strings/str_cat.h"
#include "util.h"
namespace sentencepiece {
namespace test {
namespace {
struct Test {
const char *base;
const char *name;
void (*func)();
};
std::vector *tests;
} // namespace
bool RegisterTest(const char *base, const char *name, void (*func)()) {
if (tests == nullptr) {
tests = new std::vector;
}
Test t;
t.base = base;
t.name = name;
t.func = func;
tests->emplace_back(t);
return true;
}
int RunAllTests() {
int num = 0;
#ifdef OS_WIN
_mkdir(absl::GetFlag(FLAGS_test_tmpdir).c_str());
#else
mkdir(absl::GetFlag(FLAGS_test_tmpdir).c_str(), S_IRUSR | S_IWUSR | S_IXUSR);
#endif
if (tests == nullptr) {
std::cerr << "No tests are found" << std::endl;
return 0;
}
for (const Test &t : *(tests)) {
std::cerr << "[ RUN ] " << t.base << "." << t.name << std::endl;
(*t.func)();
std::cerr << "[ OK ] " << t.base << "." << t.name << std::endl;
++num;
}
std::cerr << "==== PASSED " << num << " tests" << std::endl;
return 0;
}
} // namespace test
} // namespace sentencepiece
sentencepiece-0.1.96/src/init_test.cc 0000644 0001750 0000176 00000012357 14062671741 017074 0 ustar kenhys docker // Copyright 2016 Google Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.!
#include "init.h"
#include "common.h"
#include "testharness.h"
ABSL_FLAG(int32, int32_f, 10, "int32_flags");
ABSL_FLAG(bool, bool_f, false, "bool_flags");
ABSL_FLAG(int64, int64_f, 9223372036854775807LL, "int64_flags");
ABSL_FLAG(uint64, uint64_f, 18446744073709551615ULL, "uint64_flags");
ABSL_FLAG(double, double_f, 40.0, "double_flags");
ABSL_FLAG(std::string, string_f, "str", "string_flags");
ABSL_DECLARE_FLAG(bool, help);
ABSL_DECLARE_FLAG(bool, version);
using sentencepiece::ParseCommandLineFlags;
namespace absl {
TEST(FlagsTest, DefaultValueTest) {
EXPECT_EQ(10, absl::GetFlag(FLAGS_int32_f));
EXPECT_EQ(false, absl::GetFlag(FLAGS_bool_f));
EXPECT_EQ(9223372036854775807LL, absl::GetFlag(FLAGS_int64_f));
EXPECT_EQ(18446744073709551615ULL, absl::GetFlag(FLAGS_uint64_f));
EXPECT_EQ(40.0, absl::GetFlag(FLAGS_double_f));
EXPECT_EQ("str", absl::GetFlag(FLAGS_string_f));
}
TEST(FlagsTest, ParseCommandLineFlagsTest) {
const char *kFlags[] = {"program", "--int32_f=100", "other1",
"--bool_f=true", "--int64_f=200", "--uint64_f=300",
"--double_f=400", "--string_f=foo", "other2",
"other3"};
int argc = arraysize(kFlags);
char **argv = const_cast(kFlags);
ParseCommandLineFlags(kFlags[0], &argc, &argv);
EXPECT_EQ(100, absl::GetFlag(FLAGS_int32_f));
EXPECT_EQ(true, absl::GetFlag(FLAGS_bool_f));
EXPECT_EQ(200, absl::GetFlag(FLAGS_int64_f));
EXPECT_EQ(300, absl::GetFlag(FLAGS_uint64_f));
EXPECT_EQ(400.0, absl::GetFlag(FLAGS_double_f));
EXPECT_EQ("foo", absl::GetFlag(FLAGS_string_f));
EXPECT_EQ(4, argc);
EXPECT_EQ("program", std::string(argv[0]));
EXPECT_EQ("other1", std::string(argv[1]));
EXPECT_EQ("other2", std::string(argv[2]));
EXPECT_EQ("other3", std::string(argv[3]));
}
TEST(FlagsTest, ParseCommandLineFlagsTest2) {
const char *kFlags[] = {"program", "--int32_f", "500",
"-int64_f=600", "-uint64_f", "700",
"--bool_f=FALSE"};
int argc = arraysize(kFlags);
char **argv = const_cast(kFlags);
ParseCommandLineFlags(kFlags[0], &argc, &argv);
EXPECT_EQ(500, absl::GetFlag(FLAGS_int32_f));
EXPECT_EQ(600, absl::GetFlag(FLAGS_int64_f));
EXPECT_EQ(700, absl::GetFlag(FLAGS_uint64_f));
EXPECT_FALSE(absl::GetFlag(FLAGS_bool_f));
EXPECT_EQ(1, argc);
}
TEST(FlagsTest, ParseCommandLineFlagsTest3) {
const char *kFlags[] = {"program", "--bool_f", "--int32_f", "800"};
int argc = arraysize(kFlags);
char **argv = const_cast(kFlags);
ParseCommandLineFlags(kFlags[0], &argc, &argv);
EXPECT_TRUE(absl::GetFlag(FLAGS_bool_f));
EXPECT_EQ(800, absl::GetFlag(FLAGS_int32_f));
EXPECT_EQ(1, argc);
}
#ifndef _USE_EXTERNAL_ABSL
TEST(FlagsTest, ParseCommandLineFlagsHelpTest) {
const char *kFlags[] = {"program", "--help"};
int argc = arraysize(kFlags);
char **argv = const_cast(kFlags);
EXPECT_DEATH(ParseCommandLineFlags(kFlags[0], &argc, &argv), "");
absl::SetFlag(&FLAGS_help, false);
}
TEST(FlagsTest, ParseCommandLineFlagsVersionTest) {
const char *kFlags[] = {"program", "--version"};
int argc = arraysize(kFlags);
char **argv = const_cast(kFlags);
EXPECT_DEATH(ParseCommandLineFlags(kFlags[0], &argc, &argv), "");
absl::SetFlag(&FLAGS_version, false);
}
TEST(FlagsTest, ParseCommandLineFlagsUnknownTest) {
const char *kFlags[] = {"program", "--foo"};
int argc = arraysize(kFlags);
char **argv = const_cast(kFlags);
EXPECT_DEATH(ParseCommandLineFlags(kFlags[0], &argc, &argv), "");
}
TEST(FlagsTest, ParseCommandLineFlagsInvalidBoolTest) {
const char *kFlags[] = {"program", "--bool_f=X"};
int argc = arraysize(kFlags);
char **argv = const_cast(kFlags);
EXPECT_DEATH(ParseCommandLineFlags(kFlags[0], &argc, &argv), "");
}
TEST(FlagsTest, ParseCommandLineFlagsEmptyStringArgs) {
const char *kFlags[] = {"program", "--string_f="};
int argc = arraysize(kFlags);
char **argv = const_cast(kFlags);
ParseCommandLineFlags(kFlags[0], &argc, &argv);
EXPECT_EQ(1, argc);
EXPECT_EQ("", absl::GetFlag(FLAGS_string_f));
}
TEST(FlagsTest, ParseCommandLineFlagsEmptyBoolArgs) {
const char *kFlags[] = {"program", "--bool_f"};
int argc = arraysize(kFlags);
char **argv = const_cast(kFlags);
ParseCommandLineFlags(kFlags[0], &argc, &argv);
EXPECT_EQ(1, argc);
EXPECT_TRUE(absl::GetFlag(FLAGS_bool_f));
}
TEST(FlagsTest, ParseCommandLineFlagsEmptyIntArgs) {
const char *kFlags[] = {"program", "--int32_f"};
int argc = arraysize(kFlags);
char **argv = const_cast(kFlags);
EXPECT_DEATH(ParseCommandLineFlags(kFlags[0], &argc, &argv), );
}
#endif // _USE_EXTERNAL_ABSL
} // namespace absl
sentencepiece-0.1.96/src/builder.cc 0000644 0001750 0000176 00000043741 14062671741 016521 0 ustar kenhys docker // Copyright 2016 Google Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.!
#include
#include
#include
#include "builder.h"
#include "filesystem.h"
#include "third_party/absl/strings/str_join.h"
#include "third_party/absl/strings/str_replace.h"
#include "third_party/absl/strings/str_split.h"
#include "third_party/absl/strings/strip.h"
#ifdef ENABLE_NFKC_COMPILE
#include
#include
#include
#include
#include
#include
#endif // ENABLE_NFKC_COMPILE
#include
#include "normalization_rule.h"
#include "normalizer.h"
#include "third_party/darts_clone/darts.h"
#include "util.h"
namespace sentencepiece {
namespace normalizer {
namespace {
constexpr int kMaxUnicode = 0x10FFFF;
static constexpr char kDefaultNormalizerName[] = "nfkc";
#ifdef ENABLE_NFKC_COMPILE
// Normalize `input` with ICU's normalizer with `mode`.
Builder::Chars UnicodeNormalize(UNormalizationMode mode,
const Builder::Chars &input) {
const std::string utf8 = string_util::UnicodeTextToUTF8(input);
CHECK(!utf8.empty());
icu::UnicodeString ustr = icu::UnicodeString::fromUTF8(utf8.c_str());
UErrorCode status = U_ZERO_ERROR;
icu::UnicodeString dst;
icu::Normalizer::normalize(ustr, mode, 0, dst, status);
CHECK(U_SUCCESS(status));
std::string normalized;
normalized.reserve(dst.length() * 3);
dst.toUTF8String(normalized);
return string_util::UTF8ToUnicodeText(normalized);
}
Builder::Chars ToNFKD(const Builder::Chars &input) {
return UnicodeNormalize(UNORM_NFKD, input);
}
Builder::Chars ToNFKC(const Builder::Chars &input) {
return UnicodeNormalize(UNORM_NFKC, input);
}
Builder::Chars ToNFC(const Builder::Chars &input) {
return UnicodeNormalize(UNORM_NFC, input);
}
Builder::Chars ToNFD(const Builder::Chars &input) {
return UnicodeNormalize(UNORM_NFD, input);
}
// Given an NFKD-normalized string, returns a set of all strings which are
// normalized into the same `nfkd`. `norm2orig` is the normalized to
// un-normalized character mapping.
std::vector ExpandUnnormalized(
const Builder::Chars &nfkd,
const std::map> &norm2orig) {
CHECK(!nfkd.empty());
std::vector results;
for (const auto c : port::FindOrDie(norm2orig, nfkd[0])) {
results.push_back({c});
}
for (size_t i = 1; i < nfkd.size(); ++i) {
const auto &orig = port::FindOrDie(norm2orig, nfkd[i]);
std::vector new_results;
for (const auto &r : results) {
for (const auto c : orig) {
new_results.emplace_back(r);
new_results.back().push_back(c);
}
}
results = std::move(new_results);
}
CHECK_EQ(nfkd.size(), results[0].size());
return results;
}
#endif
// Normalizes `src` with `chars_map` and returns normalized Chars.
// `max_len` specifies the maximum length of the key in `chars_map`.
Builder::Chars Normalize(const Builder::CharsMap &chars_map,
const Builder::Chars &src, int max_len) {
CHECK_GE(max_len, 1);
Builder::Chars normalized;
for (size_t i = 0; i < src.size();) {
Builder::CharsMap::const_iterator it = chars_map.end();
const size_t slice = std::min(i + max_len, src.size());
// starts with the longest prefix.
Builder::Chars key(src.begin() + i, src.begin() + slice);
while (!key.empty()) {
it = chars_map.find(key);
if (it != chars_map.end()) {
break;
}
key.pop_back(); // remove the last character.
}
// Consumes one character when no rule is found.
if (it == chars_map.end()) {
normalized.push_back(src[i]);
++i;
} else {
std::copy(it->second.begin(), it->second.end(),
std::back_inserter(normalized));
i += it->first.size();
}
}
return normalized;
}
} // namespace
// static
util::Status Builder::CompileCharsMap(const CharsMap &chars_map,
std::string *output) {
CHECK_OR_RETURN(output);
CHECK_OR_RETURN(!chars_map.empty());
LOG(INFO) << "Loading CharsMap of size=" << chars_map.size();
// Aggregates the same target strings to save footprint.
std::map normalized2pos;
for (const auto &p : chars_map) {
normalized2pos[p.second] = 0;
}
std::string normalized;
for (auto &p : normalized2pos) {
p.second = normalized.size(); // stores the pointer (position).
const std::string utf8_out = string_util::UnicodeTextToUTF8(p.first);
CHECK_OR_RETURN(string_util::IsStructurallyValid(utf8_out));
normalized += utf8_out;
normalized += '\0';
}
std::vector> kv; // key-value of Trie.
for (const auto &p : chars_map) {
// The value of Trie stores the pointer to the normalized string.
const std::string utf8_in = string_util::UnicodeTextToUTF8(p.first);
CHECK_OR_RETURN(!utf8_in.empty());
CHECK_OR_RETURN(string_util::IsStructurallyValid(utf8_in));
kv.emplace_back(utf8_in, port::FindOrDie(normalized2pos, p.second));
}
std::sort(kv.begin(), kv.end());
std::vector key(kv.size());
std::vector value(kv.size());
for (size_t i = 0; i < kv.size(); ++i) {
key[i] = kv[i].first.c_str();
value[i] = kv[i].second;
}
Darts::DoubleArray trie;
CHECK_EQ_OR_RETURN(0, trie.build(key.size(), const_cast(&key[0]),
nullptr, &value[0]))
<< "cannot build double-array";
int max_nodes_size = 0;
std::vector results(
2 * Normalizer::kMaxTrieResultsSize);
for (const char *str : key) {
const int num_nodes = trie.commonPrefixSearch(str, results.data(),
results.size(), strlen(str));
max_nodes_size = std::max(num_nodes, max_nodes_size);
}
CHECK_LT_OR_RETURN(max_nodes_size, Normalizer::kMaxTrieResultsSize)
<< "This charmaps contain many shared prefix. "
<< "The number of shared prefix must be less than "
<< Normalizer::kMaxTrieResultsSize;
absl::string_view trie_blob(static_cast(trie.array()),
trie.size() * trie.unit_size());
*output = Normalizer::EncodePrecompiledCharsMap(trie_blob, normalized);
LOG(INFO) << "Generated normalizer blob. size=" << output->size();
return util::OkStatus();
}
// static
util::Status Builder::DecompileCharsMap(absl::string_view blob,
Builder::CharsMap *chars_map) {
CHECK_OR_RETURN(chars_map);
chars_map->clear();
absl::string_view trie_blob, normalized;
std::string buf;
RETURN_IF_ERROR(Normalizer::DecodePrecompiledCharsMap(blob, &trie_blob,
&normalized, &buf));
Darts::DoubleArray trie;
trie.set_array(const_cast(trie_blob.data()),
trie_blob.size() / trie.unit_size());
std::string key;
std::function traverse;
// Given a Trie node at `node_pos` and the key position at `key_position`,
// Expands children nodes from `node_pos`.
// When leaf nodes are found, stores them into `chars_map`.
traverse = [&traverse, &key, &trie, &normalized, &chars_map](
size_t node_pos, size_t key_pos) -> void {
for (int c = 0; c <= 255; ++c) {
key.push_back(static_cast(c));
size_t copied_node_pos = node_pos;
size_t copied_key_pos = key_pos;
// Note: `copied_(node|key)_pos` are non-const references.
// They store the new positions after node traversal.
const Darts::DoubleArray::result_type result = trie.traverse(
key.data(), copied_node_pos, copied_key_pos, key.size());
if (result >= -1) { // node exists.
if (result >= 0) { // has a value after transition.
const absl::string_view value = normalized.data() + result;
Chars key_chars, value_chars;
for (const auto c : string_util::UTF8ToUnicodeText(key))
key_chars.push_back(c);
for (const auto c : string_util::UTF8ToUnicodeText(value))
value_chars.push_back(c);
(*chars_map)[key_chars] = value_chars;
}
// Recursively traverse.
traverse(copied_node_pos, copied_key_pos);
}
key.pop_back();
}
};
traverse(0, 0);
return util::OkStatus();
}
// static
util::Status Builder::GetPrecompiledCharsMap(const std::string &name,
std::string *output) {
CHECK_OR_RETURN(output);
if (name == "identity") {
output->clear();
return util::OkStatus();
}
std::string result;
for (size_t i = 0; i < kNormalizationRules_size; ++i) {
const auto *blob = &kNormalizationRules_blob[i];
if (blob->name == name) {
output->assign(blob->data, blob->size);
return util::OkStatus();
}
}
return util::StatusBuilder(util::StatusCode::kNotFound, GTL_LOC)
<< "No precompiled charsmap is found: " << name;
}
// static
util::Status Builder::BuildNFKCMap(CharsMap *chars_map) {
#ifdef ENABLE_NFKC_COMPILE
LOG(INFO) << "Running BuildNFKCMap";
// Set of fully NFKD decomposed characters.
std::set nfkd_decomposed;
// Fully normalized one character to unnormalized one character map.
std::map> norm2orig;
Builder::CharsMap nfkc_map; // The final NFKC mapping.
constexpr int kMaxUnicode = 0x10FFFF;
for (char32 cp = 1; cp <= kMaxUnicode; ++cp) {
if (!U_IS_UNICODE_CHAR(cp)) {
continue;
}
// Aggregates single character to fully NFKC normalized characters.
const auto nfkc = ToNFKC({cp});
if (nfkc.size() >= 2 || (nfkc.size() == 1 && nfkc[0] != cp)) {
nfkc_map[{cp}] = nfkc;
}
const auto nfkd = ToNFKD({cp});
if (nfkd.size() == 1) {
// Aggregates reverse mapping from normalized to unnormalized character.
norm2orig[nfkd[0]].insert(cp);
} else {
// One character is decomposed into multiple characters.
nfkd_decomposed.insert(nfkd);
}
}
for (const auto &nfkd : nfkd_decomposed) {
const auto nfkc = ToNFC(nfkd);
// This case is already covered by single-character to NFKC mapping.
if (nfkc == nfkd) {
continue;
}
// Expand all possible sequences which are normalized into the same
// `nfkd`.
for (const auto &nfkd_orig : ExpandUnnormalized(nfkd, norm2orig)) {
if (nfkd_orig != nfkc) {
nfkc_map[nfkd_orig] = nfkc;
}
}
}
RETURN_IF_ERROR(RemoveRedundantMap(&nfkc_map));
*chars_map = std::move(nfkc_map);
#else
LOG(ERROR) << "NFKC compile is not enabled."
<< " rebuild with ./configure --enable-nfkc-compile";
#endif
return util::OkStatus();
}
util::Status Builder::BuildNmtNFKCMap(CharsMap *chars_map) {
#ifdef ENABLE_NFKC_COMPILE
LOG(INFO) << "Running BuildNmtNFKCMap";
CharsMap nfkc_map;
RETURN_IF_ERROR(Builder::BuildNFKCMap(&nfkc_map));
// Other code points considered as whitespace.
nfkc_map[{0x0009}] = {0x20}; // TAB
nfkc_map[{0x000A}] = {0x20}; // LINE FEED
nfkc_map[{0x000C}] = {0x20}; // FORM FEED
nfkc_map[{0x000D}] = {0x20}; // CARRIAGE RETURN
nfkc_map[{0x1680}] = {0x20}; // OGHAM SPACE MARK
nfkc_map[{0x200B}] = {0x20}; // ZERO WIDTH SPACE
nfkc_map[{0x200E}] = {0x20}; // LEFT-TO-RIGHT MARK
nfkc_map[{0x200F}] = {0x20}; // RIGHT-TO-LEFT MARK
nfkc_map[{0x2028}] = {0x20}; // LINE SEPARATOR
nfkc_map[{0x2029}] = {0x20}; // PARAGRAPH SEPARATOR
nfkc_map[{0x2581}] = {0x20}; // LOWER ONE EIGHT BLOCK
nfkc_map[{0xFEFF}] = {0x20}; // ZERO WIDTH NO-BREAK
nfkc_map[{0xFFFD}] = {0x20}; // REPLACEMENT CHARACTER
nfkc_map[{0x200C}] = {0x20}; // ZERO WIDTH NON-JOINER
// nfkc_map[{0x200D}] = {0x20}; // ZERO WIDTH JOINER
// Ascii Control characters
nfkc_map[{0x0001}] = {};
nfkc_map[{0x0002}] = {};
nfkc_map[{0x0003}] = {};
nfkc_map[{0x0004}] = {};
nfkc_map[{0x0005}] = {};
nfkc_map[{0x0006}] = {};
nfkc_map[{0x0007}] = {};
nfkc_map[{0x0008}] = {};
nfkc_map[{0x000B}] = {};
nfkc_map[{0x000E}] = {};
nfkc_map[{0x000F}] = {};
nfkc_map[{0x0010}] = {};
nfkc_map[{0x0011}] = {};
nfkc_map[{0x0012}] = {};
nfkc_map[{0x0013}] = {};
nfkc_map[{0x0014}] = {};
nfkc_map[{0x0015}] = {};
nfkc_map[{0x0016}] = {};
nfkc_map[{0x0017}] = {};
nfkc_map[{0x0018}] = {};
nfkc_map[{0x0019}] = {};
nfkc_map[{0x001A}] = {};
nfkc_map[{0x001B}] = {};
nfkc_map[{0x001C}] = {};
nfkc_map[{0x001D}] = {};
nfkc_map[{0x001E}] = {};
nfkc_map[{0x001F}] = {};
// ..
nfkc_map[{0x007F}] = {};
nfkc_map[{0x008F}] = {};
nfkc_map[{0x009F}] = {};
// Do not normalize FULL_WIDTH TILDE, since FULL_WIDTH TILDE
// and HALF_WIDTH TILDE are used differently in Japanese.
nfkc_map.erase({0xFF5E});
RETURN_IF_ERROR(RemoveRedundantMap(&nfkc_map));
*chars_map = std::move(nfkc_map);
#else
LOG(ERROR) << "NFKC compile is not enabled."
<< " rebuild with ./configure --enable-nfkc-compile";
#endif
return util::OkStatus();
}
// static
util::Status Builder::MergeUnicodeCaseFoldMap(Builder::CharsMap *chars_map) {
#ifdef ENABLE_NFKC_COMPILE
for (auto &c : *chars_map) {
std::vector trg;
for (char32 c : c.second) trg.push_back(u_foldCase(c, U_FOLD_CASE_DEFAULT));
c.second = trg;
}
constexpr int kMaxUnicode = 0x10FFFF;
for (char32 cp = 1; cp <= kMaxUnicode; ++cp) {
if (!U_IS_UNICODE_CHAR(cp)) {
continue;
}
if (chars_map->find({cp}) != chars_map->end()) continue;
const char32 trg = u_foldCase(cp, U_FOLD_CASE_DEFAULT);
if (trg != cp) (*chars_map)[{cp}] = {trg};
}
RETURN_IF_ERROR(RemoveRedundantMap(chars_map));
#endif
return util::OkStatus();
}
// static
util::Status Builder::BuildNFKC_CFMap(CharsMap *chars_map) {
#ifdef ENABLE_NFKC_COMPILE
CharsMap nfkc_map;
RETURN_IF_ERROR(Builder::BuildNFKCMap(&nfkc_map));
RETURN_IF_ERROR(Builder::MergeUnicodeCaseFoldMap(&nfkc_map));
*chars_map = std::move(nfkc_map);
#else
LOG(ERROR) << "NFKC_CF compile is not enabled."
<< " rebuild with ./configure --enable-nfkc-compile";
#endif
return util::OkStatus();
}
// static
util::Status Builder::BuildNmtNFKC_CFMap(CharsMap *chars_map) {
#ifdef ENABLE_NFKC_COMPILE
CharsMap nfkc_map;
RETURN_IF_ERROR(Builder::BuildNmtNFKCMap(&nfkc_map));
RETURN_IF_ERROR(Builder::MergeUnicodeCaseFoldMap(&nfkc_map));
*chars_map = std::move(nfkc_map);
#else
LOG(ERROR) << "NMT_NFKC_CF compile is not enabled."
<< " rebuild with ./configure --enable-nfkc-compile";
#endif
return util::OkStatus();
}
// static
util::Status Builder::LoadCharsMap(absl::string_view filename,
CharsMap *chars_map) {
LOG(INFO) << "Loading mapping file: " << filename.data();
CHECK_OR_RETURN(chars_map);
auto input = filesystem::NewReadableFile(filename);
RETURN_IF_ERROR(input->status());
std::string line;
chars_map->clear();
while (input->ReadLine(&line)) {
std::vector fields =
absl::StrSplit(line, '\t', absl::AllowEmpty());
CHECK_GE(fields.size(), 1);
if (fields.size() == 1) fields.push_back(""); // Deletion rule.
std::vector src, trg;
for (auto s : absl::StrSplit(fields[0], ' ')) {
if (s.empty()) continue;
absl::ConsumePrefix(&s, "U+");
src.push_back(string_util::HexToInt(s));
}
for (auto s : absl::StrSplit(fields[1], ' ')) {
if (s.empty()) continue;
absl::ConsumePrefix(&s, "U+");
trg.push_back(string_util::HexToInt(s));
}
CHECK_OR_RETURN(!src.empty());
(*chars_map)[src] = trg;
}
return util::OkStatus();
}
// static
util::Status Builder::SaveCharsMap(absl::string_view filename,
const Builder::CharsMap &chars_map) {
auto output = filesystem::NewWritableFile(filename);
RETURN_IF_ERROR(output->status());
for (const auto &c : chars_map) {
std::vector src, trg;
string_util::UnicodeText srcu, trgu;
for (char32 v : c.first) {
src.push_back(string_util::IntToHex(v));
srcu.push_back(v);
}
for (char32 v : c.second) {
trg.push_back(string_util::IntToHex(v));
trgu.push_back(v);
}
std::string line = absl::StrJoin(src, " ") + "\t" +
absl::StrJoin(trg, " ") + "\t# " +
string_util::UnicodeTextToUTF8(c.first) + " => " +
string_util::UnicodeTextToUTF8(c.second);
line = absl::StrReplaceAll(
line,
{{"\b", " "}, {"\v", " "}, {"\f", " "}, {"\n", " "}, {"\r", " "}});
output->WriteLine(line);
}
return util::OkStatus();
}
// static
util::Status Builder::RemoveRedundantMap(CharsMap *chars_map) {
CHECK_OR_RETURN(chars_map);
CharsMap new_chars_map;
size_t max_len = 0;
for (const auto &p : *chars_map) {
max_len = std::max(p.first.size(), max_len);
if (p.first.size() == 1) {
new_chars_map.insert(p);
}
}
CHECK_GT_OR_RETURN(max_len, 0);
// Checks whether the rules with size of `len` can be normalized by
// the rules with size of [1 .. len - 1].
for (size_t len = 2; len <= max_len; ++len) {
for (const auto &p : *chars_map) {
if (p.first.size() == len &&
p.second != Normalize(new_chars_map, p.first, len - 1)) {
new_chars_map.insert(p);
}
}
}
// Verify all characters in `chars_map` are normalized by `new_chars_map`.
for (const auto &p : *chars_map) {
CHECK_EQ_OR_RETURN(p.second, Normalize(new_chars_map, p.first, max_len));
}
*chars_map = std::move(new_chars_map);
return util::OkStatus();
}
} // namespace normalizer
} // namespace sentencepiece
sentencepiece-0.1.96/src/char_model_trainer.h 0000644 0001750 0000176 00000002361 14062671741 020547 0 ustar kenhys docker // Copyright 2016 Google Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.!
#ifndef CHAR_MODEL_TRAINER_H_
#define CHAR_MODEL_TRAINER_H_
#include "sentencepiece_model.pb.h"
#include "trainer_interface.h"
namespace sentencepiece {
namespace character {
// Trainer class for character model.
class Trainer : public TrainerInterface {
public:
Trainer(const TrainerSpec &trainer_spec,
const NormalizerSpec &normalizer_spec,
const NormalizerSpec &denormalizer_spec)
: TrainerInterface::TrainerInterface(trainer_spec, normalizer_spec,
denormalizer_spec) {}
util::Status Train() override;
};
} // namespace character
} // namespace sentencepiece
#endif // CHAR_MODEL_TRAINER_H_
sentencepiece-0.1.96/src/filesystem.cc 0000644 0001750 0000176 00000006753 14062671741 017261 0 ustar kenhys docker // Copyright 2016 Google Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.!
#include
#include "filesystem.h"
#include "third_party/absl/memory/memory.h"
#include "util.h"
#if defined(OS_WIN) && defined(UNICODE) && defined(_UNICODE)
#define WPATH(path) (::sentencepiece::win32::Utf8ToWide(path).c_str())
#else
#define WPATH(path) (path)
#endif
namespace sentencepiece {
namespace filesystem {
class PosixReadableFile : public ReadableFile {
public:
PosixReadableFile(absl::string_view filename, bool is_binary = false)
: is_(filename.empty()
? &std::cin
: new std::ifstream(WPATH(filename.data()),
is_binary ? std::ios::binary | std::ios::in
: std::ios::in)) {
if (!*is_)
status_ = util::StatusBuilder(util::StatusCode::kNotFound, GTL_LOC)
<< "\"" << filename.data() << "\": " << util::StrError(errno);
}
~PosixReadableFile() {
if (is_ != &std::cin) delete is_;
}
util::Status status() const { return status_; }
bool ReadLine(std::string *line) {
return static_cast(std::getline(*is_, *line));
}
bool ReadAll(std::string *line) {
if (is_ == &std::cin) {
LOG(ERROR) << "ReadAll is not supported for stdin.";
return false;
}
line->assign(std::istreambuf_iterator(*is_),
std::istreambuf_iterator());
return true;
}
private:
util::Status status_;
std::istream *is_;
};
class PosixWritableFile : public WritableFile {
public:
PosixWritableFile(absl::string_view filename, bool is_binary = false)
: os_(filename.empty()
? &std::cout
: new std::ofstream(WPATH(filename.data()),
is_binary ? std::ios::binary | std::ios::out
: std::ios::out)) {
if (!*os_)
status_ =
util::StatusBuilder(util::StatusCode::kPermissionDenied, GTL_LOC)
<< "\"" << filename.data() << "\": " << util::StrError(errno);
}
~PosixWritableFile() {
if (os_ != &std::cout) delete os_;
}
util::Status status() const { return status_; }
bool Write(absl::string_view text) {
os_->write(text.data(), text.size());
return os_->good();
}
bool WriteLine(absl::string_view text) { return Write(text) && Write("\n"); }
private:
util::Status status_;
std::ostream *os_;
};
using DefaultReadableFile = PosixReadableFile;
using DefaultWritableFile = PosixWritableFile;
std::unique_ptr NewReadableFile(absl::string_view filename,
bool is_binary) {
return absl::make_unique(filename, is_binary);
}
std::unique_ptr NewWritableFile(absl::string_view filename,
bool is_binary) {
return absl::make_unique(filename, is_binary);
}
} // namespace filesystem
} // namespace sentencepiece
sentencepiece-0.1.96/src/bpe_model.h 0000644 0001750 0000176 00000003324 14062671741 016654 0 ustar kenhys docker // Copyright 2016 Google Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.!
#ifndef BPE_MODEL_H_
#define BPE_MODEL_H_
#include "model_interface.h"
#include "sentencepiece_model.pb.h"
namespace sentencepiece {
namespace bpe {
// Segmentation model with BPE (Byte Pair Encoding)
// Details:
// Neural Machine Translation of Rare Words with Subword Units
// https://arxiv.org/abs/1508.07909
//
// https://en.wikipedia.org/wiki/Byte_pair_encoding
class Model : public ModelInterface {
public:
explicit Model(const ModelProto &model_proto);
~Model() override;
EncodeResult Encode(absl::string_view normalized) const override {
return SampleEncode(normalized, 0.0);
}
// Sampling with BPE-dropout: https://arxiv.org/pdf/1910.13267.pdf
// `alpha` is dropout probability in BPE-dropout paper.
// Skips merge operation with `alpha` probability.
// When alpha <= 0.0, no sampling is performed.
EncodeResult SampleEncode(absl::string_view normalized,
float alpha) const override;
bool IsSampleEncodeAvailable() const override { return true; }
bool IsNBestEncodeAvailable() const override { return false; }
};
} // namespace bpe
} // namespace sentencepiece
#endif // BPE_MODEL_H_
sentencepiece-0.1.96/src/model_interface_test.cc 0000644 0001750 0000176 00000037074 14062671741 021254 0 ustar kenhys docker // Copyright 2016 Google Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.!
#include "model_factory.h"
#include "model_interface.h"
#include "testharness.h"
#include "third_party/absl/container/flat_hash_map.h"
#include "util.h"
namespace sentencepiece {
namespace {
#define WS "\xe2\x96\x81"
const std::vector kModelTypes = {
TrainerSpec::UNIGRAM, TrainerSpec::BPE, TrainerSpec::WORD,
TrainerSpec::CHAR};
ModelProto MakeBaseModelProto(TrainerSpec::ModelType type,
bool byte_fallback = false) {
ModelProto model_proto;
auto *sp1 = model_proto.add_pieces();
auto *sp2 = model_proto.add_pieces();
auto *sp3 = model_proto.add_pieces();
model_proto.mutable_trainer_spec()->set_model_type(type);
model_proto.mutable_trainer_spec()->set_byte_fallback(byte_fallback);
sp1->set_type(ModelProto::SentencePiece::UNKNOWN);
sp1->set_piece("");
sp2->set_type(ModelProto::SentencePiece::CONTROL);
sp2->set_piece("");
sp3->set_type(ModelProto::SentencePiece::CONTROL);
sp3->set_piece(" ");
return model_proto;
}
void AddPiece(ModelProto *model_proto, const std::string &piece,
float score = 0.0) {
auto *sp = model_proto->add_pieces();
sp->set_piece(piece);
sp->set_score(score);
}
void AddBytePiece(ModelProto *model_proto, unsigned char byte) {
auto *sp = model_proto->add_pieces();
sp->set_piece(ByteToPiece(byte));
sp->set_type(ModelProto::SentencePiece::BYTE);
}
TEST(ModelInterfaceTest, GetDefaultPieceTest) {
{
ModelProto model_proto;
EXPECT_EQ("", model_proto.trainer_spec().unk_piece());
EXPECT_EQ("", model_proto.trainer_spec().bos_piece());
EXPECT_EQ(" ", model_proto.trainer_spec().eos_piece());
EXPECT_EQ("", model_proto.trainer_spec().pad_piece());
}
{
ModelProto model_proto = MakeBaseModelProto(TrainerSpec::UNIGRAM);
AddPiece(&model_proto, "a");
auto model = ModelFactory::Create(model_proto);
EXPECT_EQ("", model->unk_piece());
EXPECT_EQ("", model->bos_piece());
EXPECT_EQ(" ", model->eos_piece());
EXPECT_EQ("", model->pad_piece());
}
{
ModelProto model_proto = MakeBaseModelProto(TrainerSpec::UNIGRAM);
AddPiece(&model_proto, "a");
model_proto.mutable_trainer_spec()->clear_unk_piece();
model_proto.mutable_trainer_spec()->clear_bos_piece();
model_proto.mutable_trainer_spec()->clear_eos_piece();
model_proto.mutable_trainer_spec()->clear_pad_piece();
auto model = ModelFactory::Create(model_proto);
EXPECT_EQ("", model->unk_piece());
EXPECT_EQ("", model->bos_piece());
EXPECT_EQ(" ", model->eos_piece());
EXPECT_EQ("", model->pad_piece());
}
{
ModelProto model_proto = MakeBaseModelProto(TrainerSpec::UNIGRAM);
AddPiece(&model_proto, "a");
model_proto.mutable_trainer_spec()->set_unk_piece("UNK");
model_proto.mutable_trainer_spec()->set_bos_piece("BOS");
model_proto.mutable_trainer_spec()->set_eos_piece("EOS");
model_proto.mutable_trainer_spec()->set_pad_piece("PAD");
auto model = ModelFactory::Create(model_proto);
EXPECT_EQ("UNK", model->unk_piece());
EXPECT_EQ("BOS", model->bos_piece());
EXPECT_EQ("EOS", model->eos_piece());
EXPECT_EQ("PAD", model->pad_piece());
}
}
TEST(ModelInterfaceTest, SetModelInterfaceTest) {
for (const auto type : kModelTypes) {
ModelProto model_proto = MakeBaseModelProto(type);
AddPiece(&model_proto, "a");
AddPiece(&model_proto, "b");
AddPiece(&model_proto, "c");
AddPiece(&model_proto, "d");
auto model = ModelFactory::Create(model_proto);
EXPECT_EQ(model_proto.SerializeAsString(),
model->model_proto().SerializeAsString());
}
}
TEST(ModelInterfaceTest, PieceToIdTest) {
for (const auto type : kModelTypes) {
ModelProto model_proto = MakeBaseModelProto(type);
AddPiece(&model_proto, "a", 0.1); // 3
AddPiece(&model_proto, "b", 0.2); // 4
AddPiece(&model_proto, "c", 0.3); // 5
AddPiece(&model_proto, "d", 0.4); // 6
AddPiece(&model_proto, "e", 0.5); // 7
model_proto.mutable_pieces(6)->set_type(ModelProto::SentencePiece::UNUSED);
model_proto.mutable_pieces(7)->set_type(
ModelProto::SentencePiece::USER_DEFINED);
auto model = ModelFactory::Create(model_proto);
EXPECT_EQ(model_proto.SerializeAsString(),
model->model_proto().SerializeAsString());
EXPECT_EQ(0, model->PieceToId(""));
EXPECT_EQ(1, model->PieceToId(""));
EXPECT_EQ(2, model->PieceToId(" "));
EXPECT_EQ(3, model->PieceToId("a"));
EXPECT_EQ(4, model->PieceToId("b"));
EXPECT_EQ(5, model->PieceToId("c"));
EXPECT_EQ(6, model->PieceToId("d"));
EXPECT_EQ(7, model->PieceToId("e"));
EXPECT_EQ(0, model->PieceToId("f")); // unk
EXPECT_EQ(0, model->PieceToId("")); // unk
EXPECT_EQ("", model->IdToPiece(0));
EXPECT_EQ("", model->IdToPiece(1));
EXPECT_EQ(" ", model->IdToPiece(2));
EXPECT_EQ("a", model->IdToPiece(3));
EXPECT_EQ("b", model->IdToPiece(4));
EXPECT_EQ("c", model->IdToPiece(5));
EXPECT_EQ("d", model->IdToPiece(6));
EXPECT_EQ("e", model->IdToPiece(7));
EXPECT_TRUE(model->IsUnknown(0));
EXPECT_FALSE(model->IsUnknown(1));
EXPECT_FALSE(model->IsUnknown(2));
EXPECT_FALSE(model->IsUnknown(3));
EXPECT_FALSE(model->IsUnknown(4));
EXPECT_FALSE(model->IsUnknown(5));
EXPECT_FALSE(model->IsUnknown(6));
EXPECT_FALSE(model->IsUnknown(7));
EXPECT_FALSE(model->IsControl(0));
EXPECT_TRUE(model->IsControl(1));
EXPECT_TRUE(model->IsControl(2));
EXPECT_FALSE(model->IsControl(3));
EXPECT_FALSE(model->IsControl(4));
EXPECT_FALSE(model->IsControl(5));
EXPECT_FALSE(model->IsControl(6));
EXPECT_FALSE(model->IsControl(7));
EXPECT_FALSE(model->IsUnused(0));
EXPECT_FALSE(model->IsUnused(1));
EXPECT_FALSE(model->IsUnused(2));
EXPECT_FALSE(model->IsUnused(3));
EXPECT_FALSE(model->IsUnused(4));
EXPECT_FALSE(model->IsUnused(5));
EXPECT_TRUE(model->IsUnused(6));
EXPECT_FALSE(model->IsUnused(7));
EXPECT_FALSE(model->IsUserDefined(0));
EXPECT_FALSE(model->IsUserDefined(1));
EXPECT_FALSE(model->IsUserDefined(2));
EXPECT_FALSE(model->IsUserDefined(3));
EXPECT_FALSE(model->IsUserDefined(4));
EXPECT_FALSE(model->IsUserDefined(5));
EXPECT_FALSE(model->IsUserDefined(6));
EXPECT_TRUE(model->IsUserDefined(7));
EXPECT_NEAR(0, model->GetScore(0), 0.0001);
EXPECT_NEAR(0, model->GetScore(1), 0.0001);
EXPECT_NEAR(0, model->GetScore(2), 0.0001);
EXPECT_NEAR(0.1, model->GetScore(3), 0.0001);
EXPECT_NEAR(0.2, model->GetScore(4), 0.0001);
EXPECT_NEAR(0.3, model->GetScore(5), 0.0001);
EXPECT_NEAR(0.4, model->GetScore(6), 0.0001);
EXPECT_NEAR(0.5, model->GetScore(7), 0.0001);
}
}
TEST(ModelInterfaceTest, InvalidModelTest) {
// Empty piece.
{
ModelProto model_proto = MakeBaseModelProto(TrainerSpec::UNIGRAM);
AddPiece(&model_proto, "");
auto model = ModelFactory::Create(model_proto);
EXPECT_FALSE(model->status().ok());
}
// Duplicated pieces.
{
ModelProto model_proto = MakeBaseModelProto(TrainerSpec::UNIGRAM);
AddPiece(&model_proto, "a");
AddPiece(&model_proto, "a");
auto model = ModelFactory::Create(model_proto);
EXPECT_FALSE(model->status().ok());
}
// Multiple unknowns.
{
ModelProto model_proto = MakeBaseModelProto(TrainerSpec::UNIGRAM);
model_proto.mutable_pieces(1)->set_type(ModelProto::SentencePiece::UNKNOWN);
auto model = ModelFactory::Create(model_proto);
EXPECT_FALSE(model->status().ok());
}
// No unknown.
{
ModelProto model_proto = MakeBaseModelProto(TrainerSpec::UNIGRAM);
model_proto.mutable_pieces(0)->set_type(ModelProto::SentencePiece::CONTROL);
auto model = ModelFactory::Create(model_proto);
EXPECT_FALSE(model->status().ok());
}
}
TEST(ModelInterfaceTest, ByteFallbackModelTest) {
{
ModelProto model_proto = MakeBaseModelProto(TrainerSpec::UNIGRAM, true);
for (int i = 0; i < 256; ++i) {
AddBytePiece(&model_proto, i);
}
AddPiece(&model_proto, "a");
auto model = ModelFactory::Create(model_proto);
EXPECT_TRUE(model->status().ok());
}
// `byte_fallback` is true, but there are not 256 byte pieces.
{
ModelProto model_proto = MakeBaseModelProto(TrainerSpec::UNIGRAM, true);
for (int i = 0; i < 10; ++i) {
AddBytePiece(&model_proto, i);
}
AddPiece(&model_proto, "a");
auto model = ModelFactory::Create(model_proto);
EXPECT_FALSE(model->status().ok());
}
// `byte_fallback` is false, but a byte piece is found.
{
ModelProto model_proto = MakeBaseModelProto(TrainerSpec::UNIGRAM);
for (int i = 0; i < 10; ++i) {
AddBytePiece(&model_proto, i);
}
AddPiece(&model_proto, "a");
auto model = ModelFactory::Create(model_proto);
EXPECT_FALSE(model->status().ok());
}
}
std::string RandomString(int length) {
const char kAlphaNum[] =
"0123456789"
"!@#$%^&*"
"ABCDEFGHIJKLMNOPQRSTUVWXYZ"
"abcdefghijklmnopqrstuvwxyz";
const int kAlphaSize = sizeof(kAlphaNum) - 1;
const int size = rand() % length + 1;
std::string result;
for (int i = 0; i < size; ++i) {
result += kAlphaNum[rand() % kAlphaSize];
}
return result;
}
TEST(ModelInterfaceTest, PieceToIdStressTest) {
for (const auto type : kModelTypes) {
for (int i = 0; i < 100; ++i) {
absl::flat_hash_map expected_p2i;
absl::flat_hash_map expected_i2p;
ModelProto model_proto = MakeBaseModelProto(type);
for (int n = 0; n < 1000; ++n) {
const std::string piece = RandomString(10);
if (expected_p2i.find(piece) != expected_p2i.end()) {
continue;
}
expected_p2i[piece] = model_proto.pieces_size();
expected_i2p[model_proto.pieces_size()] = piece;
AddPiece(&model_proto, piece);
}
auto model = ModelFactory::Create(model_proto);
for (const auto &it : expected_p2i) {
EXPECT_EQ(it.second, model->PieceToId(it.first));
}
for (const auto &it : expected_i2p) {
EXPECT_EQ(it.second, model->IdToPiece(it.first));
}
}
}
}
TEST(ModelInterfaceTest, SplitIntoWordsTest) {
{
const auto v = SplitIntoWords(WS "this" WS "is" WS "a" WS "pen");
EXPECT_EQ(4, v.size());
EXPECT_EQ(WS "this", v[0]);
EXPECT_EQ(WS "is", v[1]);
EXPECT_EQ(WS "a", v[2]);
EXPECT_EQ(WS "pen", v[3]);
}
{
const auto v = SplitIntoWords("this" WS "is" WS "a" WS "pen");
EXPECT_EQ(4, v.size());
EXPECT_EQ("this", v[0]);
EXPECT_EQ(WS "is", v[1]);
EXPECT_EQ(WS "a", v[2]);
EXPECT_EQ(WS "pen", v[3]);
}
{
const auto v = SplitIntoWords(WS "this" WS WS "is");
EXPECT_EQ(3, v.size());
EXPECT_EQ(WS "this", v[0]);
EXPECT_EQ(WS, v[1]);
EXPECT_EQ(WS "is", v[2]);
}
{
const auto v = SplitIntoWords("");
EXPECT_TRUE(v.empty());
}
{
const auto v = SplitIntoWords("hello");
EXPECT_EQ(1, v.size());
EXPECT_EQ("hello", v[0]);
}
}
TEST(ModelInterfaceTest, SplitIntoWordsSuffixTest) {
{
const auto v = SplitIntoWords("this" WS "is" WS "a" WS "pen" WS, true);
EXPECT_EQ(4, v.size());
EXPECT_EQ("this" WS, v[0]);
EXPECT_EQ("is" WS, v[1]);
EXPECT_EQ("a" WS, v[2]);
EXPECT_EQ("pen" WS, v[3]);
}
{
const auto v = SplitIntoWords("this" WS "is" WS "a" WS "pen", true);
EXPECT_EQ(4, v.size());
EXPECT_EQ("this" WS, v[0]);
EXPECT_EQ("is" WS, v[1]);
EXPECT_EQ("a" WS, v[2]);
EXPECT_EQ("pen", v[3]);
}
{
const auto v = SplitIntoWords(WS "this" WS WS "is", true);
EXPECT_EQ(4, v.size());
EXPECT_EQ(WS, v[0]);
EXPECT_EQ("this" WS, v[1]);
EXPECT_EQ(WS, v[2]);
EXPECT_EQ("is", v[3]);
}
{
const auto v = SplitIntoWords("", true);
EXPECT_TRUE(v.empty());
}
{
const auto v = SplitIntoWords("hello", true);
EXPECT_EQ(1, v.size());
EXPECT_EQ("hello", v[0]);
}
{
const auto v = SplitIntoWords("hello" WS WS, true);
EXPECT_EQ(2, v.size());
EXPECT_EQ("hello" WS, v[0]);
EXPECT_EQ(WS, v[1]);
}
{
const auto v = SplitIntoWords(WS WS "hello" WS WS, true);
EXPECT_EQ(4, v.size());
EXPECT_EQ(WS, v[0]);
EXPECT_EQ(WS, v[1]);
EXPECT_EQ("hello" WS, v[2]);
EXPECT_EQ(WS, v[3]);
}
}
TEST(ModelInterfaceTest, SplitIntoWordsWhiteSpaceOnly) {
{
const auto v =
SplitIntoWords("this" WS "is" WS "a" WS "pen" WS, true, true);
EXPECT_EQ(4, v.size());
EXPECT_EQ("this" WS, v[0]);
EXPECT_EQ("is" WS, v[1]);
EXPECT_EQ("a" WS, v[2]);
EXPECT_EQ("pen" WS, v[3]);
}
{
const auto v = SplitIntoWords(WS WS WS "a", false, true);
EXPECT_EQ(1, v.size());
EXPECT_EQ(WS WS WS "a", v[0]);
}
{
const auto v = SplitIntoWords("a" WS WS WS, true, true);
EXPECT_EQ(1, v.size());
EXPECT_EQ("a" WS WS WS, v[0]);
}
{
const auto v = SplitIntoWords(WS WS, true, true);
EXPECT_EQ(1, v.size());
EXPECT_EQ(WS WS, v[0]);
}
{
const auto v = SplitIntoWords(WS WS "a" WS, true, true);
EXPECT_EQ(2, v.size());
EXPECT_EQ(WS WS, v[0]);
EXPECT_EQ("a" WS, v[1]);
}
{
const auto v = SplitIntoWords(WS WS "a" WS, false, true);
EXPECT_EQ(2, v.size());
EXPECT_EQ(WS WS "a", v[0]);
EXPECT_EQ(WS, v[1]);
}
}
TEST(ModelInterfaceTest, ByteToPieceTest) {
EXPECT_EQ(ByteToPiece(0), "<0x00>");
EXPECT_EQ(ByteToPiece(1), "<0x01>");
EXPECT_EQ(ByteToPiece(10), "<0x0A>");
EXPECT_EQ(ByteToPiece(16), "<0x10>");
EXPECT_EQ(ByteToPiece(255), "<0xFF>");
}
TEST(ModelInterfaceTest, PieceToByteTest) {
// Valid byte pieces.
EXPECT_EQ(PieceToByte("<0x00>"), 0);
EXPECT_EQ(PieceToByte("<0x01>"), 1);
EXPECT_EQ(PieceToByte("<0x0A>"), 10);
EXPECT_EQ(PieceToByte("<0x10>"), 16);
EXPECT_EQ(PieceToByte("<0xFF>"), 255);
// Invalid byte pieces.
EXPECT_EQ(PieceToByte("<0x0>"), -1);
EXPECT_EQ(PieceToByte("<0x000>"), -1);
EXPECT_EQ(PieceToByte("<0x001>"), -1);
EXPECT_EQ(PieceToByte("<0xff>"), -1);
EXPECT_EQ(PieceToByte("<0xFG>"), -1);
EXPECT_EQ(PieceToByte("a"), -1);
}
TEST(ModelInterfaceTest, SetEncoderVersion) {
for (const auto type : kModelTypes) {
ModelProto model_proto = MakeBaseModelProto(type);
AddPiece(&model_proto, "a");
AddPiece(&model_proto, "b");
auto model = ModelFactory::Create(model_proto);
// Verify the default encoder version.
EXPECT_EQ(EncoderVersion::kOptimized, model->GetEncoderVersion());
// Set the encoder version to original and verify.
EXPECT_TRUE(model->SetEncoderVersion(EncoderVersion::kOriginal).ok());
EXPECT_EQ(EncoderVersion::kOriginal, model->GetEncoderVersion());
}
}
TEST(ModelInterfaceTest, VerifyOutputsEquivalent) {
for (const auto type : kModelTypes) {
ModelProto model_proto = MakeBaseModelProto(type);
AddPiece(&model_proto, "a", 1.0);
AddPiece(&model_proto, "b", 2.0);
auto model = ModelFactory::Create(model_proto);
// Equivalent outputs.
EXPECT_TRUE(model->VerifyOutputsEquivalent("", ""));
EXPECT_TRUE(model->VerifyOutputsEquivalent("a b", "a b"));
// Inequivalent outputs.
EXPECT_FALSE(model->VerifyOutputsEquivalent("a", "a b"));
}
}
} // namespace
} // namespace sentencepiece
sentencepiece-0.1.96/src/init.h 0000644 0001750 0000176 00000002475 14062671741 015677 0 ustar kenhys docker // Copyright 2016 Google Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.!
#ifndef INIT_H_
#define INIT_H_
#include "common.h"
#include "third_party/absl/flags/flag.h"
#include "third_party/absl/flags/parse.h"
ABSL_DECLARE_FLAG(int32, minloglevel);
namespace sentencepiece {
inline void ParseCommandLineFlags(const char *usage, int *argc, char ***argv,
bool remove_arg = true) {
const auto unused_args = absl::ParseCommandLine(*argc, *argv);
if (remove_arg) {
char **argv_val = *argv;
*argv = argv_val = argv_val + *argc - unused_args.size();
std::copy(unused_args.begin(), unused_args.end(), argv_val);
*argc = static_cast(unused_args.size());
}
logging::SetMinLogLevel(absl::GetFlag(FLAGS_minloglevel));
}
} // namespace sentencepiece
#endif // INIT_H_
sentencepiece-0.1.96/src/word_model_trainer.cc 0000644 0001750 0000176 00000004065 14062671741 020746 0 ustar kenhys docker // Copyright 2016 Google Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.!
#include
#include
#include "third_party/absl/container/flat_hash_map.h"
#include "third_party/absl/strings/string_view.h"
#include "util.h"
#include "word_model.h"
#include "word_model_trainer.h"
namespace sentencepiece {
namespace word {
util::Status Trainer::Train() {
RETURN_IF_ERROR(status());
CHECK_OR_RETURN(normalizer_spec_.escape_whitespaces());
CHECK_EQ_OR_RETURN(TrainerSpec::WORD, trainer_spec_.model_type());
RETURN_IF_ERROR(LoadSentences());
absl::flat_hash_map freq;
for (const auto &it : sentences_) {
for (const auto &s : SplitIntoWords(it.first)) {
freq[std::string(s)] += it.second;
}
}
const int vocab_size = trainer_spec_.vocab_size() - meta_pieces_.size();
CHECK_GE_OR_RETURN(vocab_size, 0);
uint64 sum = 0;
for (const auto &it : freq) {
sum += it.second;
}
const auto logsum = std::log(static_cast(sum));
CHECK_OR_RETURN(final_pieces_.empty());
for (const auto &it : Sorted(freq)) {
if (it.first.find(kUNKStr) != std::string::npos) {
continue;
}
if (!trainer_spec_.use_all_vocab() &&
final_pieces_.size() == static_cast(vocab_size)) {
break;
}
final_pieces_.emplace_back(
it.first, std::log(static_cast(it.second)) - logsum);
}
if (trainer_spec_.use_all_vocab()) {
trainer_spec_.set_vocab_size(final_pieces_.size() + meta_pieces_.size());
}
return Save();
}
} // namespace word
} // namespace sentencepiece
sentencepiece-0.1.96/src/unicode_script_map.h 0000644 0001750 0000176 00000317716 14062671741 020612 0 ustar kenhys docker // Copyright 2016 Google Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.!
#ifndef UNICODE_SCRIPT_DATA_H_
#define UNICODE_SCRIPT_DATA_H_
#include "third_party/absl/container/flat_hash_map.h"
namespace sentencepiece {
namespace unicode_script {
namespace {
void InitTable(absl::flat_hash_map *smap) {
for (char32 c = 0x0000; c <= 0x001F; ++c) (*smap)[c] = U_Common;
(*smap)[0x0020] = U_Common;
for (char32 c = 0x0021; c <= 0x0023; ++c) (*smap)[c] = U_Common;
(*smap)[0x0024] = U_Common;
for (char32 c = 0x0025; c <= 0x0027; ++c) (*smap)[c] = U_Common;
(*smap)[0x0028] = U_Common;
(*smap)[0x0029] = U_Common;
(*smap)[0x002A] = U_Common;
(*smap)[0x002B] = U_Common;
(*smap)[0x002C] = U_Common;
(*smap)[0x002D] = U_Common;
for (char32 c = 0x002E; c <= 0x002F; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x0030; c <= 0x0039; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x003A; c <= 0x003B; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x003C; c <= 0x003E; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x003F; c <= 0x0040; ++c) (*smap)[c] = U_Common;
(*smap)[0x005B] = U_Common;
(*smap)[0x005C] = U_Common;
(*smap)[0x005D] = U_Common;
(*smap)[0x005E] = U_Common;
(*smap)[0x005F] = U_Common;
(*smap)[0x0060] = U_Common;
(*smap)[0x007B] = U_Common;
(*smap)[0x007C] = U_Common;
(*smap)[0x007D] = U_Common;
(*smap)[0x007E] = U_Common;
for (char32 c = 0x007F; c <= 0x009F; ++c) (*smap)[c] = U_Common;
(*smap)[0x00A0] = U_Common;
(*smap)[0x00A1] = U_Common;
for (char32 c = 0x00A2; c <= 0x00A5; ++c) (*smap)[c] = U_Common;
(*smap)[0x00A6] = U_Common;
(*smap)[0x00A7] = U_Common;
(*smap)[0x00A8] = U_Common;
(*smap)[0x00A9] = U_Common;
(*smap)[0x00AB] = U_Common;
(*smap)[0x00AC] = U_Common;
(*smap)[0x00AD] = U_Common;
(*smap)[0x00AE] = U_Common;
(*smap)[0x00AF] = U_Common;
(*smap)[0x00B0] = U_Common;
(*smap)[0x00B1] = U_Common;
for (char32 c = 0x00B2; c <= 0x00B3; ++c) (*smap)[c] = U_Common;
(*smap)[0x00B4] = U_Common;
(*smap)[0x00B5] = U_Common;
for (char32 c = 0x00B6; c <= 0x00B7; ++c) (*smap)[c] = U_Common;
(*smap)[0x00B8] = U_Common;
(*smap)[0x00B9] = U_Common;
(*smap)[0x00BB] = U_Common;
for (char32 c = 0x00BC; c <= 0x00BE; ++c) (*smap)[c] = U_Common;
(*smap)[0x00BF] = U_Common;
(*smap)[0x00D7] = U_Common;
(*smap)[0x00F7] = U_Common;
for (char32 c = 0x02B9; c <= 0x02C1; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x02C2; c <= 0x02C5; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x02C6; c <= 0x02D1; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x02D2; c <= 0x02DF; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x02E5; c <= 0x02E9; ++c) (*smap)[c] = U_Common;
(*smap)[0x02EC] = U_Common;
(*smap)[0x02ED] = U_Common;
(*smap)[0x02EE] = U_Common;
for (char32 c = 0x02EF; c <= 0x02FF; ++c) (*smap)[c] = U_Common;
(*smap)[0x0374] = U_Common;
(*smap)[0x037E] = U_Common;
(*smap)[0x0385] = U_Common;
(*smap)[0x0387] = U_Common;
(*smap)[0x0589] = U_Common;
(*smap)[0x0605] = U_Common;
(*smap)[0x060C] = U_Common;
(*smap)[0x061B] = U_Common;
(*smap)[0x061C] = U_Common;
(*smap)[0x061F] = U_Common;
(*smap)[0x0640] = U_Common;
(*smap)[0x06DD] = U_Common;
(*smap)[0x08E2] = U_Common;
for (char32 c = 0x0964; c <= 0x0965; ++c) (*smap)[c] = U_Common;
(*smap)[0x0E3F] = U_Common;
for (char32 c = 0x0FD5; c <= 0x0FD8; ++c) (*smap)[c] = U_Common;
(*smap)[0x10FB] = U_Common;
for (char32 c = 0x16EB; c <= 0x16ED; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x1735; c <= 0x1736; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x1802; c <= 0x1803; ++c) (*smap)[c] = U_Common;
(*smap)[0x1805] = U_Common;
(*smap)[0x1CD3] = U_Common;
(*smap)[0x1CE1] = U_Common;
for (char32 c = 0x1CE9; c <= 0x1CEC; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x1CEE; c <= 0x1CF1; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x1CF2; c <= 0x1CF3; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x1CF5; c <= 0x1CF6; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x2000; c <= 0x200A; ++c) (*smap)[c] = U_Common;
(*smap)[0x200B] = U_Common;
for (char32 c = 0x200E; c <= 0x200F; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x2010; c <= 0x2015; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x2016; c <= 0x2017; ++c) (*smap)[c] = U_Common;
(*smap)[0x2018] = U_Common;
(*smap)[0x2019] = U_Common;
(*smap)[0x201A] = U_Common;
for (char32 c = 0x201B; c <= 0x201C; ++c) (*smap)[c] = U_Common;
(*smap)[0x201D] = U_Common;
(*smap)[0x201E] = U_Common;
(*smap)[0x201F] = U_Common;
for (char32 c = 0x2020; c <= 0x2027; ++c) (*smap)[c] = U_Common;
(*smap)[0x2028] = U_Common;
(*smap)[0x2029] = U_Common;
for (char32 c = 0x202A; c <= 0x202E; ++c) (*smap)[c] = U_Common;
(*smap)[0x202F] = U_Common;
for (char32 c = 0x2030; c <= 0x2038; ++c) (*smap)[c] = U_Common;
(*smap)[0x2039] = U_Common;
(*smap)[0x203A] = U_Common;
for (char32 c = 0x203B; c <= 0x203E; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x203F; c <= 0x2040; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x2041; c <= 0x2043; ++c) (*smap)[c] = U_Common;
(*smap)[0x2044] = U_Common;
(*smap)[0x2045] = U_Common;
(*smap)[0x2046] = U_Common;
for (char32 c = 0x2047; c <= 0x2051; ++c) (*smap)[c] = U_Common;
(*smap)[0x2052] = U_Common;
(*smap)[0x2053] = U_Common;
(*smap)[0x2054] = U_Common;
for (char32 c = 0x2055; c <= 0x205E; ++c) (*smap)[c] = U_Common;
(*smap)[0x205F] = U_Common;
for (char32 c = 0x2060; c <= 0x2064; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x2066; c <= 0x206F; ++c) (*smap)[c] = U_Common;
(*smap)[0x2070] = U_Common;
for (char32 c = 0x2074; c <= 0x2079; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x207A; c <= 0x207C; ++c) (*smap)[c] = U_Common;
(*smap)[0x207D] = U_Common;
(*smap)[0x207E] = U_Common;
for (char32 c = 0x2080; c <= 0x2089; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x208A; c <= 0x208C; ++c) (*smap)[c] = U_Common;
(*smap)[0x208D] = U_Common;
(*smap)[0x208E] = U_Common;
for (char32 c = 0x20A0; c <= 0x20BE; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x2100; c <= 0x2101; ++c) (*smap)[c] = U_Common;
(*smap)[0x2102] = U_Common;
for (char32 c = 0x2103; c <= 0x2106; ++c) (*smap)[c] = U_Common;
(*smap)[0x2107] = U_Common;
for (char32 c = 0x2108; c <= 0x2109; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x210A; c <= 0x2113; ++c) (*smap)[c] = U_Common;
(*smap)[0x2114] = U_Common;
(*smap)[0x2115] = U_Common;
for (char32 c = 0x2116; c <= 0x2117; ++c) (*smap)[c] = U_Common;
(*smap)[0x2118] = U_Common;
for (char32 c = 0x2119; c <= 0x211D; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x211E; c <= 0x2123; ++c) (*smap)[c] = U_Common;
(*smap)[0x2124] = U_Common;
(*smap)[0x2125] = U_Common;
(*smap)[0x2127] = U_Common;
(*smap)[0x2128] = U_Common;
(*smap)[0x2129] = U_Common;
for (char32 c = 0x212C; c <= 0x212D; ++c) (*smap)[c] = U_Common;
(*smap)[0x212E] = U_Common;
for (char32 c = 0x212F; c <= 0x2131; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x2133; c <= 0x2134; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x2135; c <= 0x2138; ++c) (*smap)[c] = U_Common;
(*smap)[0x2139] = U_Common;
for (char32 c = 0x213A; c <= 0x213B; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x213C; c <= 0x213F; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x2140; c <= 0x2144; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x2145; c <= 0x2149; ++c) (*smap)[c] = U_Common;
(*smap)[0x214A] = U_Common;
(*smap)[0x214B] = U_Common;
for (char32 c = 0x214C; c <= 0x214D; ++c) (*smap)[c] = U_Common;
(*smap)[0x214F] = U_Common;
for (char32 c = 0x2150; c <= 0x215F; ++c) (*smap)[c] = U_Common;
(*smap)[0x2189] = U_Common;
for (char32 c = 0x218A; c <= 0x218B; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x2190; c <= 0x2194; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x2195; c <= 0x2199; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x219A; c <= 0x219B; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x219C; c <= 0x219F; ++c) (*smap)[c] = U_Common;
(*smap)[0x21A0] = U_Common;
for (char32 c = 0x21A1; c <= 0x21A2; ++c) (*smap)[c] = U_Common;
(*smap)[0x21A3] = U_Common;
for (char32 c = 0x21A4; c <= 0x21A5; ++c) (*smap)[c] = U_Common;
(*smap)[0x21A6] = U_Common;
for (char32 c = 0x21A7; c <= 0x21AD; ++c) (*smap)[c] = U_Common;
(*smap)[0x21AE] = U_Common;
for (char32 c = 0x21AF; c <= 0x21CD; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x21CE; c <= 0x21CF; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x21D0; c <= 0x21D1; ++c) (*smap)[c] = U_Common;
(*smap)[0x21D2] = U_Common;
(*smap)[0x21D3] = U_Common;
(*smap)[0x21D4] = U_Common;
for (char32 c = 0x21D5; c <= 0x21F3; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x21F4; c <= 0x22FF; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x2300; c <= 0x2307; ++c) (*smap)[c] = U_Common;
(*smap)[0x2308] = U_Common;
(*smap)[0x2309] = U_Common;
(*smap)[0x230A] = U_Common;
(*smap)[0x230B] = U_Common;
for (char32 c = 0x230C; c <= 0x231F; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x2320; c <= 0x2321; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x2322; c <= 0x2328; ++c) (*smap)[c] = U_Common;
(*smap)[0x2329] = U_Common;
(*smap)[0x232A] = U_Common;
for (char32 c = 0x232B; c <= 0x237B; ++c) (*smap)[c] = U_Common;
(*smap)[0x237C] = U_Common;
for (char32 c = 0x237D; c <= 0x239A; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x239B; c <= 0x23B3; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x23B4; c <= 0x23DB; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x23DC; c <= 0x23E1; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x23E2; c <= 0x23FE; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x2400; c <= 0x2426; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x2440; c <= 0x244A; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x2460; c <= 0x249B; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x249C; c <= 0x24E9; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x24EA; c <= 0x24FF; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x2500; c <= 0x25B6; ++c) (*smap)[c] = U_Common;
(*smap)[0x25B7] = U_Common;
for (char32 c = 0x25B8; c <= 0x25C0; ++c) (*smap)[c] = U_Common;
(*smap)[0x25C1] = U_Common;
for (char32 c = 0x25C2; c <= 0x25F7; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x25F8; c <= 0x25FF; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x2600; c <= 0x266E; ++c) (*smap)[c] = U_Common;
(*smap)[0x266F] = U_Common;
for (char32 c = 0x2670; c <= 0x2767; ++c) (*smap)[c] = U_Common;
(*smap)[0x2768] = U_Common;
(*smap)[0x2769] = U_Common;
(*smap)[0x276A] = U_Common;
(*smap)[0x276B] = U_Common;
(*smap)[0x276C] = U_Common;
(*smap)[0x276D] = U_Common;
(*smap)[0x276E] = U_Common;
(*smap)[0x276F] = U_Common;
(*smap)[0x2770] = U_Common;
(*smap)[0x2771] = U_Common;
(*smap)[0x2772] = U_Common;
(*smap)[0x2773] = U_Common;
(*smap)[0x2774] = U_Common;
(*smap)[0x2775] = U_Common;
for (char32 c = 0x2776; c <= 0x2793; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x2794; c <= 0x27BF; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x27C0; c <= 0x27C4; ++c) (*smap)[c] = U_Common;
(*smap)[0x27C5] = U_Common;
(*smap)[0x27C6] = U_Common;
for (char32 c = 0x27C7; c <= 0x27E5; ++c) (*smap)[c] = U_Common;
(*smap)[0x27E6] = U_Common;
(*smap)[0x27E7] = U_Common;
(*smap)[0x27E8] = U_Common;
(*smap)[0x27E9] = U_Common;
(*smap)[0x27EA] = U_Common;
(*smap)[0x27EB] = U_Common;
(*smap)[0x27EC] = U_Common;
(*smap)[0x27ED] = U_Common;
(*smap)[0x27EE] = U_Common;
(*smap)[0x27EF] = U_Common;
for (char32 c = 0x27F0; c <= 0x27FF; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x2900; c <= 0x2982; ++c) (*smap)[c] = U_Common;
(*smap)[0x2983] = U_Common;
(*smap)[0x2984] = U_Common;
(*smap)[0x2985] = U_Common;
(*smap)[0x2986] = U_Common;
(*smap)[0x2987] = U_Common;
(*smap)[0x2988] = U_Common;
(*smap)[0x2989] = U_Common;
(*smap)[0x298A] = U_Common;
(*smap)[0x298B] = U_Common;
(*smap)[0x298C] = U_Common;
(*smap)[0x298D] = U_Common;
(*smap)[0x298E] = U_Common;
(*smap)[0x298F] = U_Common;
(*smap)[0x2990] = U_Common;
(*smap)[0x2991] = U_Common;
(*smap)[0x2992] = U_Common;
(*smap)[0x2993] = U_Common;
(*smap)[0x2994] = U_Common;
(*smap)[0x2995] = U_Common;
(*smap)[0x2996] = U_Common;
(*smap)[0x2997] = U_Common;
(*smap)[0x2998] = U_Common;
for (char32 c = 0x2999; c <= 0x29D7; ++c) (*smap)[c] = U_Common;
(*smap)[0x29D8] = U_Common;
(*smap)[0x29D9] = U_Common;
(*smap)[0x29DA] = U_Common;
(*smap)[0x29DB] = U_Common;
for (char32 c = 0x29DC; c <= 0x29FB; ++c) (*smap)[c] = U_Common;
(*smap)[0x29FC] = U_Common;
(*smap)[0x29FD] = U_Common;
for (char32 c = 0x29FE; c <= 0x2AFF; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x2B00; c <= 0x2B2F; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x2B30; c <= 0x2B44; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x2B45; c <= 0x2B46; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x2B47; c <= 0x2B4C; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x2B4D; c <= 0x2B73; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x2B76; c <= 0x2B95; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x2B98; c <= 0x2BB9; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x2BBD; c <= 0x2BC8; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x2BCA; c <= 0x2BD1; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x2BEC; c <= 0x2BEF; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x2E00; c <= 0x2E01; ++c) (*smap)[c] = U_Common;
(*smap)[0x2E02] = U_Common;
(*smap)[0x2E03] = U_Common;
(*smap)[0x2E04] = U_Common;
(*smap)[0x2E05] = U_Common;
for (char32 c = 0x2E06; c <= 0x2E08; ++c) (*smap)[c] = U_Common;
(*smap)[0x2E09] = U_Common;
(*smap)[0x2E0A] = U_Common;
(*smap)[0x2E0B] = U_Common;
(*smap)[0x2E0C] = U_Common;
(*smap)[0x2E0D] = U_Common;
for (char32 c = 0x2E0E; c <= 0x2E16; ++c) (*smap)[c] = U_Common;
(*smap)[0x2E17] = U_Common;
for (char32 c = 0x2E18; c <= 0x2E19; ++c) (*smap)[c] = U_Common;
(*smap)[0x2E1A] = U_Common;
(*smap)[0x2E1B] = U_Common;
(*smap)[0x2E1C] = U_Common;
(*smap)[0x2E1D] = U_Common;
for (char32 c = 0x2E1E; c <= 0x2E1F; ++c) (*smap)[c] = U_Common;
(*smap)[0x2E20] = U_Common;
(*smap)[0x2E21] = U_Common;
(*smap)[0x2E22] = U_Common;
(*smap)[0x2E23] = U_Common;
(*smap)[0x2E24] = U_Common;
(*smap)[0x2E25] = U_Common;
(*smap)[0x2E26] = U_Common;
(*smap)[0x2E27] = U_Common;
(*smap)[0x2E28] = U_Common;
(*smap)[0x2E29] = U_Common;
for (char32 c = 0x2E2A; c <= 0x2E2E; ++c) (*smap)[c] = U_Common;
(*smap)[0x2E2F] = U_Common;
for (char32 c = 0x2E30; c <= 0x2E39; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x2E3A; c <= 0x2E3B; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x2E3C; c <= 0x2E3F; ++c) (*smap)[c] = U_Common;
(*smap)[0x2E40] = U_Common;
(*smap)[0x2E41] = U_Common;
(*smap)[0x2E42] = U_Common;
for (char32 c = 0x2E43; c <= 0x2E44; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x2FF0; c <= 0x2FFB; ++c) (*smap)[c] = U_Common;
(*smap)[0x3000] = U_Common;
for (char32 c = 0x3001; c <= 0x3003; ++c) (*smap)[c] = U_Common;
(*smap)[0x3004] = U_Common;
(*smap)[0x3006] = U_Common;
(*smap)[0x3008] = U_Common;
(*smap)[0x3009] = U_Common;
(*smap)[0x300A] = U_Common;
(*smap)[0x300B] = U_Common;
(*smap)[0x300C] = U_Common;
(*smap)[0x300D] = U_Common;
(*smap)[0x300E] = U_Common;
(*smap)[0x300F] = U_Common;
(*smap)[0x3010] = U_Common;
(*smap)[0x3011] = U_Common;
for (char32 c = 0x3012; c <= 0x3013; ++c) (*smap)[c] = U_Common;
(*smap)[0x3014] = U_Common;
(*smap)[0x3015] = U_Common;
(*smap)[0x3016] = U_Common;
(*smap)[0x3017] = U_Common;
(*smap)[0x3018] = U_Common;
(*smap)[0x3019] = U_Common;
(*smap)[0x301A] = U_Common;
(*smap)[0x301B] = U_Common;
(*smap)[0x301C] = U_Common;
(*smap)[0x301D] = U_Common;
for (char32 c = 0x301E; c <= 0x301F; ++c) (*smap)[c] = U_Common;
(*smap)[0x3020] = U_Common;
(*smap)[0x3030] = U_Common;
for (char32 c = 0x3031; c <= 0x3035; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x3036; c <= 0x3037; ++c) (*smap)[c] = U_Common;
(*smap)[0x303C] = U_Common;
(*smap)[0x303D] = U_Common;
for (char32 c = 0x303E; c <= 0x303F; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x309B; c <= 0x309C; ++c) (*smap)[c] = U_Common;
(*smap)[0x30A0] = U_Common;
(*smap)[0x30FB] = U_Common;
(*smap)[0x30FC] = U_Common;
for (char32 c = 0x3190; c <= 0x3191; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x3192; c <= 0x3195; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x3196; c <= 0x319F; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x31C0; c <= 0x31E3; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x3220; c <= 0x3229; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x322A; c <= 0x3247; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x3248; c <= 0x324F; ++c) (*smap)[c] = U_Common;
(*smap)[0x3250] = U_Common;
for (char32 c = 0x3251; c <= 0x325F; ++c) (*smap)[c] = U_Common;
(*smap)[0x327F] = U_Common;
for (char32 c = 0x3280; c <= 0x3289; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x328A; c <= 0x32B0; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x32B1; c <= 0x32BF; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x32C0; c <= 0x32CF; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x3358; c <= 0x33FF; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x4DC0; c <= 0x4DFF; ++c) (*smap)[c] = U_Common;
for (char32 c = 0xA700; c <= 0xA716; ++c) (*smap)[c] = U_Common;
for (char32 c = 0xA717; c <= 0xA71F; ++c) (*smap)[c] = U_Common;
for (char32 c = 0xA720; c <= 0xA721; ++c) (*smap)[c] = U_Common;
(*smap)[0xA788] = U_Common;
for (char32 c = 0xA789; c <= 0xA78A; ++c) (*smap)[c] = U_Common;
for (char32 c = 0xA830; c <= 0xA835; ++c) (*smap)[c] = U_Common;
for (char32 c = 0xA836; c <= 0xA837; ++c) (*smap)[c] = U_Common;
(*smap)[0xA838] = U_Common;
(*smap)[0xA839] = U_Common;
(*smap)[0xA92E] = U_Common;
(*smap)[0xA9CF] = U_Common;
(*smap)[0xAB5B] = U_Common;
(*smap)[0xFD3E] = U_Common;
(*smap)[0xFD3F] = U_Common;
for (char32 c = 0xFE10; c <= 0xFE16; ++c) (*smap)[c] = U_Common;
(*smap)[0xFE17] = U_Common;
(*smap)[0xFE18] = U_Common;
(*smap)[0xFE19] = U_Common;
(*smap)[0xFE30] = U_Common;
for (char32 c = 0xFE31; c <= 0xFE32; ++c) (*smap)[c] = U_Common;
for (char32 c = 0xFE33; c <= 0xFE34; ++c) (*smap)[c] = U_Common;
(*smap)[0xFE35] = U_Common;
(*smap)[0xFE36] = U_Common;
(*smap)[0xFE37] = U_Common;
(*smap)[0xFE38] = U_Common;
(*smap)[0xFE39] = U_Common;
(*smap)[0xFE3A] = U_Common;
(*smap)[0xFE3B] = U_Common;
(*smap)[0xFE3C] = U_Common;
(*smap)[0xFE3D] = U_Common;
(*smap)[0xFE3E] = U_Common;
(*smap)[0xFE3F] = U_Common;
(*smap)[0xFE40] = U_Common;
(*smap)[0xFE41] = U_Common;
(*smap)[0xFE42] = U_Common;
(*smap)[0xFE43] = U_Common;
(*smap)[0xFE44] = U_Common;
for (char32 c = 0xFE45; c <= 0xFE46; ++c) (*smap)[c] = U_Common;
(*smap)[0xFE47] = U_Common;
(*smap)[0xFE48] = U_Common;
for (char32 c = 0xFE49; c <= 0xFE4C; ++c) (*smap)[c] = U_Common;
for (char32 c = 0xFE4D; c <= 0xFE4F; ++c) (*smap)[c] = U_Common;
for (char32 c = 0xFE50; c <= 0xFE52; ++c) (*smap)[c] = U_Common;
for (char32 c = 0xFE54; c <= 0xFE57; ++c) (*smap)[c] = U_Common;
(*smap)[0xFE58] = U_Common;
(*smap)[0xFE59] = U_Common;
(*smap)[0xFE5A] = U_Common;
(*smap)[0xFE5B] = U_Common;
(*smap)[0xFE5C] = U_Common;
(*smap)[0xFE5D] = U_Common;
(*smap)[0xFE5E] = U_Common;
for (char32 c = 0xFE5F; c <= 0xFE61; ++c) (*smap)[c] = U_Common;
(*smap)[0xFE62] = U_Common;
(*smap)[0xFE63] = U_Common;
for (char32 c = 0xFE64; c <= 0xFE66; ++c) (*smap)[c] = U_Common;
(*smap)[0xFE68] = U_Common;
(*smap)[0xFE69] = U_Common;
for (char32 c = 0xFE6A; c <= 0xFE6B; ++c) (*smap)[c] = U_Common;
(*smap)[0xFEFF] = U_Common;
for (char32 c = 0xFF01; c <= 0xFF03; ++c) (*smap)[c] = U_Common;
(*smap)[0xFF04] = U_Common;
for (char32 c = 0xFF05; c <= 0xFF07; ++c) (*smap)[c] = U_Common;
(*smap)[0xFF08] = U_Common;
(*smap)[0xFF09] = U_Common;
(*smap)[0xFF0A] = U_Common;
(*smap)[0xFF0B] = U_Common;
(*smap)[0xFF0C] = U_Common;
(*smap)[0xFF0D] = U_Common;
for (char32 c = 0xFF0E; c <= 0xFF0F; ++c) (*smap)[c] = U_Common;
for (char32 c = 0xFF10; c <= 0xFF19; ++c) (*smap)[c] = U_Common;
for (char32 c = 0xFF1A; c <= 0xFF1B; ++c) (*smap)[c] = U_Common;
for (char32 c = 0xFF1C; c <= 0xFF1E; ++c) (*smap)[c] = U_Common;
for (char32 c = 0xFF1F; c <= 0xFF20; ++c) (*smap)[c] = U_Common;
(*smap)[0xFF3B] = U_Common;
(*smap)[0xFF3C] = U_Common;
(*smap)[0xFF3D] = U_Common;
(*smap)[0xFF3E] = U_Common;
(*smap)[0xFF3F] = U_Common;
(*smap)[0xFF40] = U_Common;
(*smap)[0xFF5B] = U_Common;
(*smap)[0xFF5C] = U_Common;
(*smap)[0xFF5D] = U_Common;
(*smap)[0xFF5E] = U_Common;
(*smap)[0xFF5F] = U_Common;
(*smap)[0xFF60] = U_Common;
(*smap)[0xFF61] = U_Common;
(*smap)[0xFF62] = U_Common;
(*smap)[0xFF63] = U_Common;
for (char32 c = 0xFF64; c <= 0xFF65; ++c) (*smap)[c] = U_Common;
(*smap)[0xFF70] = U_Common;
for (char32 c = 0xFF9E; c <= 0xFF9F; ++c) (*smap)[c] = U_Common;
for (char32 c = 0xFFE0; c <= 0xFFE1; ++c) (*smap)[c] = U_Common;
(*smap)[0xFFE2] = U_Common;
(*smap)[0xFFE3] = U_Common;
(*smap)[0xFFE4] = U_Common;
for (char32 c = 0xFFE5; c <= 0xFFE6; ++c) (*smap)[c] = U_Common;
(*smap)[0xFFE8] = U_Common;
for (char32 c = 0xFFE9; c <= 0xFFEC; ++c) (*smap)[c] = U_Common;
for (char32 c = 0xFFED; c <= 0xFFEE; ++c) (*smap)[c] = U_Common;
for (char32 c = 0xFFF9; c <= 0xFFFB; ++c) (*smap)[c] = U_Common;
for (char32 c = 0xFFFC; c <= 0xFFFD; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x10100; c <= 0x10102; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x10107; c <= 0x10133; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x10137; c <= 0x1013F; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x10190; c <= 0x1019B; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x101D0; c <= 0x101FC; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x102E1; c <= 0x102FB; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x1BCA0; c <= 0x1BCA3; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x1D000; c <= 0x1D0F5; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x1D100; c <= 0x1D126; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x1D129; c <= 0x1D164; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x1D165; c <= 0x1D166; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x1D16A; c <= 0x1D16C; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x1D16D; c <= 0x1D172; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x1D173; c <= 0x1D17A; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x1D183; c <= 0x1D184; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x1D18C; c <= 0x1D1A9; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x1D1AE; c <= 0x1D1E8; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x1D300; c <= 0x1D356; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x1D360; c <= 0x1D371; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x1D400; c <= 0x1D454; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x1D456; c <= 0x1D49C; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x1D49E; c <= 0x1D49F; ++c) (*smap)[c] = U_Common;
(*smap)[0x1D4A2] = U_Common;
for (char32 c = 0x1D4A5; c <= 0x1D4A6; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x1D4A9; c <= 0x1D4AC; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x1D4AE; c <= 0x1D4B9; ++c) (*smap)[c] = U_Common;
(*smap)[0x1D4BB] = U_Common;
for (char32 c = 0x1D4BD; c <= 0x1D4C3; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x1D4C5; c <= 0x1D505; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x1D507; c <= 0x1D50A; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x1D50D; c <= 0x1D514; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x1D516; c <= 0x1D51C; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x1D51E; c <= 0x1D539; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x1D53B; c <= 0x1D53E; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x1D540; c <= 0x1D544; ++c) (*smap)[c] = U_Common;
(*smap)[0x1D546] = U_Common;
for (char32 c = 0x1D54A; c <= 0x1D550; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x1D552; c <= 0x1D6A5; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x1D6A8; c <= 0x1D6C0; ++c) (*smap)[c] = U_Common;
(*smap)[0x1D6C1] = U_Common;
for (char32 c = 0x1D6C2; c <= 0x1D6DA; ++c) (*smap)[c] = U_Common;
(*smap)[0x1D6DB] = U_Common;
for (char32 c = 0x1D6DC; c <= 0x1D6FA; ++c) (*smap)[c] = U_Common;
(*smap)[0x1D6FB] = U_Common;
for (char32 c = 0x1D6FC; c <= 0x1D714; ++c) (*smap)[c] = U_Common;
(*smap)[0x1D715] = U_Common;
for (char32 c = 0x1D716; c <= 0x1D734; ++c) (*smap)[c] = U_Common;
(*smap)[0x1D735] = U_Common;
for (char32 c = 0x1D736; c <= 0x1D74E; ++c) (*smap)[c] = U_Common;
(*smap)[0x1D74F] = U_Common;
for (char32 c = 0x1D750; c <= 0x1D76E; ++c) (*smap)[c] = U_Common;
(*smap)[0x1D76F] = U_Common;
for (char32 c = 0x1D770; c <= 0x1D788; ++c) (*smap)[c] = U_Common;
(*smap)[0x1D789] = U_Common;
for (char32 c = 0x1D78A; c <= 0x1D7A8; ++c) (*smap)[c] = U_Common;
(*smap)[0x1D7A9] = U_Common;
for (char32 c = 0x1D7AA; c <= 0x1D7C2; ++c) (*smap)[c] = U_Common;
(*smap)[0x1D7C3] = U_Common;
for (char32 c = 0x1D7C4; c <= 0x1D7CB; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x1D7CE; c <= 0x1D7FF; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x1F000; c <= 0x1F02B; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x1F030; c <= 0x1F093; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x1F0A0; c <= 0x1F0AE; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x1F0B1; c <= 0x1F0BF; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x1F0C1; c <= 0x1F0CF; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x1F0D1; c <= 0x1F0F5; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x1F100; c <= 0x1F10C; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x1F110; c <= 0x1F12E; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x1F130; c <= 0x1F16B; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x1F170; c <= 0x1F1AC; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x1F1E6; c <= 0x1F1FF; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x1F201; c <= 0x1F202; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x1F210; c <= 0x1F23B; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x1F240; c <= 0x1F248; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x1F250; c <= 0x1F251; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x1F300; c <= 0x1F3FA; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x1F3FB; c <= 0x1F3FF; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x1F400; c <= 0x1F6D2; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x1F6E0; c <= 0x1F6EC; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x1F6F0; c <= 0x1F6F6; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x1F700; c <= 0x1F773; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x1F780; c <= 0x1F7D4; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x1F800; c <= 0x1F80B; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x1F810; c <= 0x1F847; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x1F850; c <= 0x1F859; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x1F860; c <= 0x1F887; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x1F890; c <= 0x1F8AD; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x1F910; c <= 0x1F91E; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x1F920; c <= 0x1F927; ++c) (*smap)[c] = U_Common;
(*smap)[0x1F930] = U_Common;
for (char32 c = 0x1F933; c <= 0x1F93E; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x1F940; c <= 0x1F94B; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x1F950; c <= 0x1F95E; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x1F980; c <= 0x1F991; ++c) (*smap)[c] = U_Common;
(*smap)[0x1F9C0] = U_Common;
(*smap)[0xE0001] = U_Common;
for (char32 c = 0xE0020; c <= 0xE007F; ++c) (*smap)[c] = U_Common;
for (char32 c = 0x0041; c <= 0x005A; ++c) (*smap)[c] = U_Latin;
for (char32 c = 0x0061; c <= 0x007A; ++c) (*smap)[c] = U_Latin;
(*smap)[0x00AA] = U_Latin;
(*smap)[0x00BA] = U_Latin;
for (char32 c = 0x00C0; c <= 0x00D6; ++c) (*smap)[c] = U_Latin;
for (char32 c = 0x00D8; c <= 0x00F6; ++c) (*smap)[c] = U_Latin;
for (char32 c = 0x00F8; c <= 0x01BA; ++c) (*smap)[c] = U_Latin;
(*smap)[0x01BB] = U_Latin;
for (char32 c = 0x01BC; c <= 0x01BF; ++c) (*smap)[c] = U_Latin;
for (char32 c = 0x01C0; c <= 0x01C3; ++c) (*smap)[c] = U_Latin;
for (char32 c = 0x01C4; c <= 0x0293; ++c) (*smap)[c] = U_Latin;
(*smap)[0x0294] = U_Latin;
for (char32 c = 0x0295; c <= 0x02AF; ++c) (*smap)[c] = U_Latin;
for (char32 c = 0x02B0; c <= 0x02B8; ++c) (*smap)[c] = U_Latin;
for (char32 c = 0x02E0; c <= 0x02E4; ++c) (*smap)[c] = U_Latin;
for (char32 c = 0x1D00; c <= 0x1D25; ++c) (*smap)[c] = U_Latin;
for (char32 c = 0x1D2C; c <= 0x1D5C; ++c) (*smap)[c] = U_Latin;
for (char32 c = 0x1D62; c <= 0x1D65; ++c) (*smap)[c] = U_Latin;
for (char32 c = 0x1D6B; c <= 0x1D77; ++c) (*smap)[c] = U_Latin;
for (char32 c = 0x1D79; c <= 0x1D9A; ++c) (*smap)[c] = U_Latin;
for (char32 c = 0x1D9B; c <= 0x1DBE; ++c) (*smap)[c] = U_Latin;
for (char32 c = 0x1E00; c <= 0x1EFF; ++c) (*smap)[c] = U_Latin;
(*smap)[0x2071] = U_Latin;
(*smap)[0x207F] = U_Latin;
for (char32 c = 0x2090; c <= 0x209C; ++c) (*smap)[c] = U_Latin;
for (char32 c = 0x212A; c <= 0x212B; ++c) (*smap)[c] = U_Latin;
(*smap)[0x2132] = U_Latin;
(*smap)[0x214E] = U_Latin;
for (char32 c = 0x2160; c <= 0x2182; ++c) (*smap)[c] = U_Latin;
for (char32 c = 0x2183; c <= 0x2184; ++c) (*smap)[c] = U_Latin;
for (char32 c = 0x2185; c <= 0x2188; ++c) (*smap)[c] = U_Latin;
for (char32 c = 0x2C60; c <= 0x2C7B; ++c) (*smap)[c] = U_Latin;
for (char32 c = 0x2C7C; c <= 0x2C7D; ++c) (*smap)[c] = U_Latin;
for (char32 c = 0x2C7E; c <= 0x2C7F; ++c) (*smap)[c] = U_Latin;
for (char32 c = 0xA722; c <= 0xA76F; ++c) (*smap)[c] = U_Latin;
(*smap)[0xA770] = U_Latin;
for (char32 c = 0xA771; c <= 0xA787; ++c) (*smap)[c] = U_Latin;
for (char32 c = 0xA78B; c <= 0xA78E; ++c) (*smap)[c] = U_Latin;
(*smap)[0xA78F] = U_Latin;
for (char32 c = 0xA790; c <= 0xA7AE; ++c) (*smap)[c] = U_Latin;
for (char32 c = 0xA7B0; c <= 0xA7B7; ++c) (*smap)[c] = U_Latin;
(*smap)[0xA7F7] = U_Latin;
for (char32 c = 0xA7F8; c <= 0xA7F9; ++c) (*smap)[c] = U_Latin;
(*smap)[0xA7FA] = U_Latin;
for (char32 c = 0xA7FB; c <= 0xA7FF; ++c) (*smap)[c] = U_Latin;
for (char32 c = 0xAB30; c <= 0xAB5A; ++c) (*smap)[c] = U_Latin;
for (char32 c = 0xAB5C; c <= 0xAB5F; ++c) (*smap)[c] = U_Latin;
for (char32 c = 0xAB60; c <= 0xAB64; ++c) (*smap)[c] = U_Latin;
for (char32 c = 0xFB00; c <= 0xFB06; ++c) (*smap)[c] = U_Latin;
for (char32 c = 0xFF21; c <= 0xFF3A; ++c) (*smap)[c] = U_Latin;
for (char32 c = 0xFF41; c <= 0xFF5A; ++c) (*smap)[c] = U_Latin;
for (char32 c = 0x0370; c <= 0x0373; ++c) (*smap)[c] = U_Greek;
(*smap)[0x0375] = U_Greek;
for (char32 c = 0x0376; c <= 0x0377; ++c) (*smap)[c] = U_Greek;
(*smap)[0x037A] = U_Greek;
for (char32 c = 0x037B; c <= 0x037D; ++c) (*smap)[c] = U_Greek;
(*smap)[0x037F] = U_Greek;
(*smap)[0x0384] = U_Greek;
(*smap)[0x0386] = U_Greek;
for (char32 c = 0x0388; c <= 0x038A; ++c) (*smap)[c] = U_Greek;
(*smap)[0x038C] = U_Greek;
for (char32 c = 0x038E; c <= 0x03A1; ++c) (*smap)[c] = U_Greek;
for (char32 c = 0x03A3; c <= 0x03E1; ++c) (*smap)[c] = U_Greek;
for (char32 c = 0x03F0; c <= 0x03F5; ++c) (*smap)[c] = U_Greek;
(*smap)[0x03F6] = U_Greek;
for (char32 c = 0x03F7; c <= 0x03FF; ++c) (*smap)[c] = U_Greek;
for (char32 c = 0x1D26; c <= 0x1D2A; ++c) (*smap)[c] = U_Greek;
for (char32 c = 0x1D5D; c <= 0x1D61; ++c) (*smap)[c] = U_Greek;
for (char32 c = 0x1D66; c <= 0x1D6A; ++c) (*smap)[c] = U_Greek;
(*smap)[0x1DBF] = U_Greek;
for (char32 c = 0x1F00; c <= 0x1F15; ++c) (*smap)[c] = U_Greek;
for (char32 c = 0x1F18; c <= 0x1F1D; ++c) (*smap)[c] = U_Greek;
for (char32 c = 0x1F20; c <= 0x1F45; ++c) (*smap)[c] = U_Greek;
for (char32 c = 0x1F48; c <= 0x1F4D; ++c) (*smap)[c] = U_Greek;
for (char32 c = 0x1F50; c <= 0x1F57; ++c) (*smap)[c] = U_Greek;
(*smap)[0x1F59] = U_Greek;
(*smap)[0x1F5B] = U_Greek;
(*smap)[0x1F5D] = U_Greek;
for (char32 c = 0x1F5F; c <= 0x1F7D; ++c) (*smap)[c] = U_Greek;
for (char32 c = 0x1F80; c <= 0x1FB4; ++c) (*smap)[c] = U_Greek;
for (char32 c = 0x1FB6; c <= 0x1FBC; ++c) (*smap)[c] = U_Greek;
(*smap)[0x1FBD] = U_Greek;
(*smap)[0x1FBE] = U_Greek;
for (char32 c = 0x1FBF; c <= 0x1FC1; ++c) (*smap)[c] = U_Greek;
for (char32 c = 0x1FC2; c <= 0x1FC4; ++c) (*smap)[c] = U_Greek;
for (char32 c = 0x1FC6; c <= 0x1FCC; ++c) (*smap)[c] = U_Greek;
for (char32 c = 0x1FCD; c <= 0x1FCF; ++c) (*smap)[c] = U_Greek;
for (char32 c = 0x1FD0; c <= 0x1FD3; ++c) (*smap)[c] = U_Greek;
for (char32 c = 0x1FD6; c <= 0x1FDB; ++c) (*smap)[c] = U_Greek;
for (char32 c = 0x1FDD; c <= 0x1FDF; ++c) (*smap)[c] = U_Greek;
for (char32 c = 0x1FE0; c <= 0x1FEC; ++c) (*smap)[c] = U_Greek;
for (char32 c = 0x1FED; c <= 0x1FEF; ++c) (*smap)[c] = U_Greek;
for (char32 c = 0x1FF2; c <= 0x1FF4; ++c) (*smap)[c] = U_Greek;
for (char32 c = 0x1FF6; c <= 0x1FFC; ++c) (*smap)[c] = U_Greek;
for (char32 c = 0x1FFD; c <= 0x1FFE; ++c) (*smap)[c] = U_Greek;
(*smap)[0x2126] = U_Greek;
(*smap)[0xAB65] = U_Greek;
for (char32 c = 0x10140; c <= 0x10174; ++c) (*smap)[c] = U_Greek;
for (char32 c = 0x10175; c <= 0x10178; ++c) (*smap)[c] = U_Greek;
for (char32 c = 0x10179; c <= 0x10189; ++c) (*smap)[c] = U_Greek;
for (char32 c = 0x1018A; c <= 0x1018B; ++c) (*smap)[c] = U_Greek;
for (char32 c = 0x1018C; c <= 0x1018E; ++c) (*smap)[c] = U_Greek;
(*smap)[0x101A0] = U_Greek;
for (char32 c = 0x1D200; c <= 0x1D241; ++c) (*smap)[c] = U_Greek;
for (char32 c = 0x1D242; c <= 0x1D244; ++c) (*smap)[c] = U_Greek;
(*smap)[0x1D245] = U_Greek;
for (char32 c = 0x0400; c <= 0x0481; ++c) (*smap)[c] = U_Cyrillic;
(*smap)[0x0482] = U_Cyrillic;
for (char32 c = 0x0483; c <= 0x0484; ++c) (*smap)[c] = U_Cyrillic;
(*smap)[0x0487] = U_Cyrillic;
for (char32 c = 0x0488; c <= 0x0489; ++c) (*smap)[c] = U_Cyrillic;
for (char32 c = 0x048A; c <= 0x052F; ++c) (*smap)[c] = U_Cyrillic;
for (char32 c = 0x1C80; c <= 0x1C88; ++c) (*smap)[c] = U_Cyrillic;
(*smap)[0x1D2B] = U_Cyrillic;
(*smap)[0x1D78] = U_Cyrillic;
for (char32 c = 0x2DE0; c <= 0x2DFF; ++c) (*smap)[c] = U_Cyrillic;
for (char32 c = 0xA640; c <= 0xA66D; ++c) (*smap)[c] = U_Cyrillic;
(*smap)[0xA66E] = U_Cyrillic;
(*smap)[0xA66F] = U_Cyrillic;
for (char32 c = 0xA670; c <= 0xA672; ++c) (*smap)[c] = U_Cyrillic;
(*smap)[0xA673] = U_Cyrillic;
for (char32 c = 0xA674; c <= 0xA67D; ++c) (*smap)[c] = U_Cyrillic;
(*smap)[0xA67E] = U_Cyrillic;
(*smap)[0xA67F] = U_Cyrillic;
for (char32 c = 0xA680; c <= 0xA69B; ++c) (*smap)[c] = U_Cyrillic;
for (char32 c = 0xA69C; c <= 0xA69D; ++c) (*smap)[c] = U_Cyrillic;
for (char32 c = 0xA69E; c <= 0xA69F; ++c) (*smap)[c] = U_Cyrillic;
for (char32 c = 0xFE2E; c <= 0xFE2F; ++c) (*smap)[c] = U_Cyrillic;
for (char32 c = 0x0531; c <= 0x0556; ++c) (*smap)[c] = U_Armenian;
(*smap)[0x0559] = U_Armenian;
for (char32 c = 0x055A; c <= 0x055F; ++c) (*smap)[c] = U_Armenian;
for (char32 c = 0x0561; c <= 0x0587; ++c) (*smap)[c] = U_Armenian;
(*smap)[0x058A] = U_Armenian;
for (char32 c = 0x058D; c <= 0x058E; ++c) (*smap)[c] = U_Armenian;
(*smap)[0x058F] = U_Armenian;
for (char32 c = 0xFB13; c <= 0xFB17; ++c) (*smap)[c] = U_Armenian;
for (char32 c = 0x0591; c <= 0x05BD; ++c) (*smap)[c] = U_Hebrew;
(*smap)[0x05BE] = U_Hebrew;
(*smap)[0x05BF] = U_Hebrew;
(*smap)[0x05C0] = U_Hebrew;
for (char32 c = 0x05C1; c <= 0x05C2; ++c) (*smap)[c] = U_Hebrew;
(*smap)[0x05C3] = U_Hebrew;
for (char32 c = 0x05C4; c <= 0x05C5; ++c) (*smap)[c] = U_Hebrew;
(*smap)[0x05C6] = U_Hebrew;
(*smap)[0x05C7] = U_Hebrew;
for (char32 c = 0x05D0; c <= 0x05EA; ++c) (*smap)[c] = U_Hebrew;
for (char32 c = 0x05F0; c <= 0x05F2; ++c) (*smap)[c] = U_Hebrew;
for (char32 c = 0x05F3; c <= 0x05F4; ++c) (*smap)[c] = U_Hebrew;
(*smap)[0xFB1D] = U_Hebrew;
(*smap)[0xFB1E] = U_Hebrew;
for (char32 c = 0xFB1F; c <= 0xFB28; ++c) (*smap)[c] = U_Hebrew;
(*smap)[0xFB29] = U_Hebrew;
for (char32 c = 0xFB2A; c <= 0xFB36; ++c) (*smap)[c] = U_Hebrew;
for (char32 c = 0xFB38; c <= 0xFB3C; ++c) (*smap)[c] = U_Hebrew;
(*smap)[0xFB3E] = U_Hebrew;
for (char32 c = 0xFB40; c <= 0xFB41; ++c) (*smap)[c] = U_Hebrew;
for (char32 c = 0xFB43; c <= 0xFB44; ++c) (*smap)[c] = U_Hebrew;
for (char32 c = 0xFB46; c <= 0xFB4F; ++c) (*smap)[c] = U_Hebrew;
for (char32 c = 0x0600; c <= 0x0604; ++c) (*smap)[c] = U_Arabic;
for (char32 c = 0x0606; c <= 0x0608; ++c) (*smap)[c] = U_Arabic;
for (char32 c = 0x0609; c <= 0x060A; ++c) (*smap)[c] = U_Arabic;
(*smap)[0x060B] = U_Arabic;
(*smap)[0x060D] = U_Arabic;
for (char32 c = 0x060E; c <= 0x060F; ++c) (*smap)[c] = U_Arabic;
for (char32 c = 0x0610; c <= 0x061A; ++c) (*smap)[c] = U_Arabic;
(*smap)[0x061E] = U_Arabic;
for (char32 c = 0x0620; c <= 0x063F; ++c) (*smap)[c] = U_Arabic;
for (char32 c = 0x0641; c <= 0x064A; ++c) (*smap)[c] = U_Arabic;
for (char32 c = 0x0656; c <= 0x065F; ++c) (*smap)[c] = U_Arabic;
for (char32 c = 0x0660; c <= 0x0669; ++c) (*smap)[c] = U_Arabic;
for (char32 c = 0x066A; c <= 0x066D; ++c) (*smap)[c] = U_Arabic;
for (char32 c = 0x066E; c <= 0x066F; ++c) (*smap)[c] = U_Arabic;
for (char32 c = 0x0671; c <= 0x06D3; ++c) (*smap)[c] = U_Arabic;
(*smap)[0x06D4] = U_Arabic;
(*smap)[0x06D5] = U_Arabic;
for (char32 c = 0x06D6; c <= 0x06DC; ++c) (*smap)[c] = U_Arabic;
(*smap)[0x06DE] = U_Arabic;
for (char32 c = 0x06DF; c <= 0x06E4; ++c) (*smap)[c] = U_Arabic;
for (char32 c = 0x06E5; c <= 0x06E6; ++c) (*smap)[c] = U_Arabic;
for (char32 c = 0x06E7; c <= 0x06E8; ++c) (*smap)[c] = U_Arabic;
(*smap)[0x06E9] = U_Arabic;
for (char32 c = 0x06EA; c <= 0x06ED; ++c) (*smap)[c] = U_Arabic;
for (char32 c = 0x06EE; c <= 0x06EF; ++c) (*smap)[c] = U_Arabic;
for (char32 c = 0x06F0; c <= 0x06F9; ++c) (*smap)[c] = U_Arabic;
for (char32 c = 0x06FA; c <= 0x06FC; ++c) (*smap)[c] = U_Arabic;
for (char32 c = 0x06FD; c <= 0x06FE; ++c) (*smap)[c] = U_Arabic;
(*smap)[0x06FF] = U_Arabic;
for (char32 c = 0x0750; c <= 0x077F; ++c) (*smap)[c] = U_Arabic;
for (char32 c = 0x08A0; c <= 0x08B4; ++c) (*smap)[c] = U_Arabic;
for (char32 c = 0x08B6; c <= 0x08BD; ++c) (*smap)[c] = U_Arabic;
for (char32 c = 0x08D4; c <= 0x08E1; ++c) (*smap)[c] = U_Arabic;
for (char32 c = 0x08E3; c <= 0x08FF; ++c) (*smap)[c] = U_Arabic;
for (char32 c = 0xFB50; c <= 0xFBB1; ++c) (*smap)[c] = U_Arabic;
for (char32 c = 0xFBB2; c <= 0xFBC1; ++c) (*smap)[c] = U_Arabic;
for (char32 c = 0xFBD3; c <= 0xFD3D; ++c) (*smap)[c] = U_Arabic;
for (char32 c = 0xFD50; c <= 0xFD8F; ++c) (*smap)[c] = U_Arabic;
for (char32 c = 0xFD92; c <= 0xFDC7; ++c) (*smap)[c] = U_Arabic;
for (char32 c = 0xFDF0; c <= 0xFDFB; ++c) (*smap)[c] = U_Arabic;
(*smap)[0xFDFC] = U_Arabic;
(*smap)[0xFDFD] = U_Arabic;
for (char32 c = 0xFE70; c <= 0xFE74; ++c) (*smap)[c] = U_Arabic;
for (char32 c = 0xFE76; c <= 0xFEFC; ++c) (*smap)[c] = U_Arabic;
for (char32 c = 0x10E60; c <= 0x10E7E; ++c) (*smap)[c] = U_Arabic;
for (char32 c = 0x1EE00; c <= 0x1EE03; ++c) (*smap)[c] = U_Arabic;
for (char32 c = 0x1EE05; c <= 0x1EE1F; ++c) (*smap)[c] = U_Arabic;
for (char32 c = 0x1EE21; c <= 0x1EE22; ++c) (*smap)[c] = U_Arabic;
(*smap)[0x1EE24] = U_Arabic;
(*smap)[0x1EE27] = U_Arabic;
for (char32 c = 0x1EE29; c <= 0x1EE32; ++c) (*smap)[c] = U_Arabic;
for (char32 c = 0x1EE34; c <= 0x1EE37; ++c) (*smap)[c] = U_Arabic;
(*smap)[0x1EE39] = U_Arabic;
(*smap)[0x1EE3B] = U_Arabic;
(*smap)[0x1EE42] = U_Arabic;
(*smap)[0x1EE47] = U_Arabic;
(*smap)[0x1EE49] = U_Arabic;
(*smap)[0x1EE4B] = U_Arabic;
for (char32 c = 0x1EE4D; c <= 0x1EE4F; ++c) (*smap)[c] = U_Arabic;
for (char32 c = 0x1EE51; c <= 0x1EE52; ++c) (*smap)[c] = U_Arabic;
(*smap)[0x1EE54] = U_Arabic;
(*smap)[0x1EE57] = U_Arabic;
(*smap)[0x1EE59] = U_Arabic;
(*smap)[0x1EE5B] = U_Arabic;
(*smap)[0x1EE5D] = U_Arabic;
(*smap)[0x1EE5F] = U_Arabic;
for (char32 c = 0x1EE61; c <= 0x1EE62; ++c) (*smap)[c] = U_Arabic;
(*smap)[0x1EE64] = U_Arabic;
for (char32 c = 0x1EE67; c <= 0x1EE6A; ++c) (*smap)[c] = U_Arabic;
for (char32 c = 0x1EE6C; c <= 0x1EE72; ++c) (*smap)[c] = U_Arabic;
for (char32 c = 0x1EE74; c <= 0x1EE77; ++c) (*smap)[c] = U_Arabic;
for (char32 c = 0x1EE79; c <= 0x1EE7C; ++c) (*smap)[c] = U_Arabic;
(*smap)[0x1EE7E] = U_Arabic;
for (char32 c = 0x1EE80; c <= 0x1EE89; ++c) (*smap)[c] = U_Arabic;
for (char32 c = 0x1EE8B; c <= 0x1EE9B; ++c) (*smap)[c] = U_Arabic;
for (char32 c = 0x1EEA1; c <= 0x1EEA3; ++c) (*smap)[c] = U_Arabic;
for (char32 c = 0x1EEA5; c <= 0x1EEA9; ++c) (*smap)[c] = U_Arabic;
for (char32 c = 0x1EEAB; c <= 0x1EEBB; ++c) (*smap)[c] = U_Arabic;
for (char32 c = 0x1EEF0; c <= 0x1EEF1; ++c) (*smap)[c] = U_Arabic;
for (char32 c = 0x0700; c <= 0x070D; ++c) (*smap)[c] = U_Syriac;
(*smap)[0x070F] = U_Syriac;
(*smap)[0x0710] = U_Syriac;
(*smap)[0x0711] = U_Syriac;
for (char32 c = 0x0712; c <= 0x072F; ++c) (*smap)[c] = U_Syriac;
for (char32 c = 0x0730; c <= 0x074A; ++c) (*smap)[c] = U_Syriac;
for (char32 c = 0x074D; c <= 0x074F; ++c) (*smap)[c] = U_Syriac;
for (char32 c = 0x0780; c <= 0x07A5; ++c) (*smap)[c] = U_Thaana;
for (char32 c = 0x07A6; c <= 0x07B0; ++c) (*smap)[c] = U_Thaana;
(*smap)[0x07B1] = U_Thaana;
for (char32 c = 0x0900; c <= 0x0902; ++c) (*smap)[c] = U_Devanagari;
(*smap)[0x0903] = U_Devanagari;
for (char32 c = 0x0904; c <= 0x0939; ++c) (*smap)[c] = U_Devanagari;
(*smap)[0x093A] = U_Devanagari;
(*smap)[0x093B] = U_Devanagari;
(*smap)[0x093C] = U_Devanagari;
(*smap)[0x093D] = U_Devanagari;
for (char32 c = 0x093E; c <= 0x0940; ++c) (*smap)[c] = U_Devanagari;
for (char32 c = 0x0941; c <= 0x0948; ++c) (*smap)[c] = U_Devanagari;
for (char32 c = 0x0949; c <= 0x094C; ++c) (*smap)[c] = U_Devanagari;
(*smap)[0x094D] = U_Devanagari;
for (char32 c = 0x094E; c <= 0x094F; ++c) (*smap)[c] = U_Devanagari;
(*smap)[0x0950] = U_Devanagari;
for (char32 c = 0x0953; c <= 0x0957; ++c) (*smap)[c] = U_Devanagari;
for (char32 c = 0x0958; c <= 0x0961; ++c) (*smap)[c] = U_Devanagari;
for (char32 c = 0x0962; c <= 0x0963; ++c) (*smap)[c] = U_Devanagari;
for (char32 c = 0x0966; c <= 0x096F; ++c) (*smap)[c] = U_Devanagari;
(*smap)[0x0970] = U_Devanagari;
(*smap)[0x0971] = U_Devanagari;
for (char32 c = 0x0972; c <= 0x097F; ++c) (*smap)[c] = U_Devanagari;
for (char32 c = 0xA8E0; c <= 0xA8F1; ++c) (*smap)[c] = U_Devanagari;
for (char32 c = 0xA8F2; c <= 0xA8F7; ++c) (*smap)[c] = U_Devanagari;
for (char32 c = 0xA8F8; c <= 0xA8FA; ++c) (*smap)[c] = U_Devanagari;
(*smap)[0xA8FB] = U_Devanagari;
(*smap)[0xA8FC] = U_Devanagari;
(*smap)[0xA8FD] = U_Devanagari;
(*smap)[0x0980] = U_Bengali;
(*smap)[0x0981] = U_Bengali;
for (char32 c = 0x0982; c <= 0x0983; ++c) (*smap)[c] = U_Bengali;
for (char32 c = 0x0985; c <= 0x098C; ++c) (*smap)[c] = U_Bengali;
for (char32 c = 0x098F; c <= 0x0990; ++c) (*smap)[c] = U_Bengali;
for (char32 c = 0x0993; c <= 0x09A8; ++c) (*smap)[c] = U_Bengali;
for (char32 c = 0x09AA; c <= 0x09B0; ++c) (*smap)[c] = U_Bengali;
(*smap)[0x09B2] = U_Bengali;
for (char32 c = 0x09B6; c <= 0x09B9; ++c) (*smap)[c] = U_Bengali;
(*smap)[0x09BC] = U_Bengali;
(*smap)[0x09BD] = U_Bengali;
for (char32 c = 0x09BE; c <= 0x09C0; ++c) (*smap)[c] = U_Bengali;
for (char32 c = 0x09C1; c <= 0x09C4; ++c) (*smap)[c] = U_Bengali;
for (char32 c = 0x09C7; c <= 0x09C8; ++c) (*smap)[c] = U_Bengali;
for (char32 c = 0x09CB; c <= 0x09CC; ++c) (*smap)[c] = U_Bengali;
(*smap)[0x09CD] = U_Bengali;
(*smap)[0x09CE] = U_Bengali;
(*smap)[0x09D7] = U_Bengali;
for (char32 c = 0x09DC; c <= 0x09DD; ++c) (*smap)[c] = U_Bengali;
for (char32 c = 0x09DF; c <= 0x09E1; ++c) (*smap)[c] = U_Bengali;
for (char32 c = 0x09E2; c <= 0x09E3; ++c) (*smap)[c] = U_Bengali;
for (char32 c = 0x09E6; c <= 0x09EF; ++c) (*smap)[c] = U_Bengali;
for (char32 c = 0x09F0; c <= 0x09F1; ++c) (*smap)[c] = U_Bengali;
for (char32 c = 0x09F2; c <= 0x09F3; ++c) (*smap)[c] = U_Bengali;
for (char32 c = 0x09F4; c <= 0x09F9; ++c) (*smap)[c] = U_Bengali;
(*smap)[0x09FA] = U_Bengali;
(*smap)[0x09FB] = U_Bengali;
for (char32 c = 0x0A01; c <= 0x0A02; ++c) (*smap)[c] = U_Gurmukhi;
(*smap)[0x0A03] = U_Gurmukhi;
for (char32 c = 0x0A05; c <= 0x0A0A; ++c) (*smap)[c] = U_Gurmukhi;
for (char32 c = 0x0A0F; c <= 0x0A10; ++c) (*smap)[c] = U_Gurmukhi;
for (char32 c = 0x0A13; c <= 0x0A28; ++c) (*smap)[c] = U_Gurmukhi;
for (char32 c = 0x0A2A; c <= 0x0A30; ++c) (*smap)[c] = U_Gurmukhi;
for (char32 c = 0x0A32; c <= 0x0A33; ++c) (*smap)[c] = U_Gurmukhi;
for (char32 c = 0x0A35; c <= 0x0A36; ++c) (*smap)[c] = U_Gurmukhi;
for (char32 c = 0x0A38; c <= 0x0A39; ++c) (*smap)[c] = U_Gurmukhi;
(*smap)[0x0A3C] = U_Gurmukhi;
for (char32 c = 0x0A3E; c <= 0x0A40; ++c) (*smap)[c] = U_Gurmukhi;
for (char32 c = 0x0A41; c <= 0x0A42; ++c) (*smap)[c] = U_Gurmukhi;
for (char32 c = 0x0A47; c <= 0x0A48; ++c) (*smap)[c] = U_Gurmukhi;
for (char32 c = 0x0A4B; c <= 0x0A4D; ++c) (*smap)[c] = U_Gurmukhi;
(*smap)[0x0A51] = U_Gurmukhi;
for (char32 c = 0x0A59; c <= 0x0A5C; ++c) (*smap)[c] = U_Gurmukhi;
(*smap)[0x0A5E] = U_Gurmukhi;
for (char32 c = 0x0A66; c <= 0x0A6F; ++c) (*smap)[c] = U_Gurmukhi;
for (char32 c = 0x0A70; c <= 0x0A71; ++c) (*smap)[c] = U_Gurmukhi;
for (char32 c = 0x0A72; c <= 0x0A74; ++c) (*smap)[c] = U_Gurmukhi;
(*smap)[0x0A75] = U_Gurmukhi;
for (char32 c = 0x0A81; c <= 0x0A82; ++c) (*smap)[c] = U_Gujarati;
(*smap)[0x0A83] = U_Gujarati;
for (char32 c = 0x0A85; c <= 0x0A8D; ++c) (*smap)[c] = U_Gujarati;
for (char32 c = 0x0A8F; c <= 0x0A91; ++c) (*smap)[c] = U_Gujarati;
for (char32 c = 0x0A93; c <= 0x0AA8; ++c) (*smap)[c] = U_Gujarati;
for (char32 c = 0x0AAA; c <= 0x0AB0; ++c) (*smap)[c] = U_Gujarati;
for (char32 c = 0x0AB2; c <= 0x0AB3; ++c) (*smap)[c] = U_Gujarati;
for (char32 c = 0x0AB5; c <= 0x0AB9; ++c) (*smap)[c] = U_Gujarati;
(*smap)[0x0ABC] = U_Gujarati;
(*smap)[0x0ABD] = U_Gujarati;
for (char32 c = 0x0ABE; c <= 0x0AC0; ++c) (*smap)[c] = U_Gujarati;
for (char32 c = 0x0AC1; c <= 0x0AC5; ++c) (*smap)[c] = U_Gujarati;
for (char32 c = 0x0AC7; c <= 0x0AC8; ++c) (*smap)[c] = U_Gujarati;
(*smap)[0x0AC9] = U_Gujarati;
for (char32 c = 0x0ACB; c <= 0x0ACC; ++c) (*smap)[c] = U_Gujarati;
(*smap)[0x0ACD] = U_Gujarati;
(*smap)[0x0AD0] = U_Gujarati;
for (char32 c = 0x0AE0; c <= 0x0AE1; ++c) (*smap)[c] = U_Gujarati;
for (char32 c = 0x0AE2; c <= 0x0AE3; ++c) (*smap)[c] = U_Gujarati;
for (char32 c = 0x0AE6; c <= 0x0AEF; ++c) (*smap)[c] = U_Gujarati;
(*smap)[0x0AF0] = U_Gujarati;
(*smap)[0x0AF1] = U_Gujarati;
(*smap)[0x0AF9] = U_Gujarati;
(*smap)[0x0B01] = U_Oriya;
for (char32 c = 0x0B02; c <= 0x0B03; ++c) (*smap)[c] = U_Oriya;
for (char32 c = 0x0B05; c <= 0x0B0C; ++c) (*smap)[c] = U_Oriya;
for (char32 c = 0x0B0F; c <= 0x0B10; ++c) (*smap)[c] = U_Oriya;
for (char32 c = 0x0B13; c <= 0x0B28; ++c) (*smap)[c] = U_Oriya;
for (char32 c = 0x0B2A; c <= 0x0B30; ++c) (*smap)[c] = U_Oriya;
for (char32 c = 0x0B32; c <= 0x0B33; ++c) (*smap)[c] = U_Oriya;
for (char32 c = 0x0B35; c <= 0x0B39; ++c) (*smap)[c] = U_Oriya;
(*smap)[0x0B3C] = U_Oriya;
(*smap)[0x0B3D] = U_Oriya;
(*smap)[0x0B3E] = U_Oriya;
(*smap)[0x0B3F] = U_Oriya;
(*smap)[0x0B40] = U_Oriya;
for (char32 c = 0x0B41; c <= 0x0B44; ++c) (*smap)[c] = U_Oriya;
for (char32 c = 0x0B47; c <= 0x0B48; ++c) (*smap)[c] = U_Oriya;
for (char32 c = 0x0B4B; c <= 0x0B4C; ++c) (*smap)[c] = U_Oriya;
(*smap)[0x0B4D] = U_Oriya;
(*smap)[0x0B56] = U_Oriya;
(*smap)[0x0B57] = U_Oriya;
for (char32 c = 0x0B5C; c <= 0x0B5D; ++c) (*smap)[c] = U_Oriya;
for (char32 c = 0x0B5F; c <= 0x0B61; ++c) (*smap)[c] = U_Oriya;
for (char32 c = 0x0B62; c <= 0x0B63; ++c) (*smap)[c] = U_Oriya;
for (char32 c = 0x0B66; c <= 0x0B6F; ++c) (*smap)[c] = U_Oriya;
(*smap)[0x0B70] = U_Oriya;
(*smap)[0x0B71] = U_Oriya;
for (char32 c = 0x0B72; c <= 0x0B77; ++c) (*smap)[c] = U_Oriya;
(*smap)[0x0B82] = U_Tamil;
(*smap)[0x0B83] = U_Tamil;
for (char32 c = 0x0B85; c <= 0x0B8A; ++c) (*smap)[c] = U_Tamil;
for (char32 c = 0x0B8E; c <= 0x0B90; ++c) (*smap)[c] = U_Tamil;
for (char32 c = 0x0B92; c <= 0x0B95; ++c) (*smap)[c] = U_Tamil;
for (char32 c = 0x0B99; c <= 0x0B9A; ++c) (*smap)[c] = U_Tamil;
(*smap)[0x0B9C] = U_Tamil;
for (char32 c = 0x0B9E; c <= 0x0B9F; ++c) (*smap)[c] = U_Tamil;
for (char32 c = 0x0BA3; c <= 0x0BA4; ++c) (*smap)[c] = U_Tamil;
for (char32 c = 0x0BA8; c <= 0x0BAA; ++c) (*smap)[c] = U_Tamil;
for (char32 c = 0x0BAE; c <= 0x0BB9; ++c) (*smap)[c] = U_Tamil;
for (char32 c = 0x0BBE; c <= 0x0BBF; ++c) (*smap)[c] = U_Tamil;
(*smap)[0x0BC0] = U_Tamil;
for (char32 c = 0x0BC1; c <= 0x0BC2; ++c) (*smap)[c] = U_Tamil;
for (char32 c = 0x0BC6; c <= 0x0BC8; ++c) (*smap)[c] = U_Tamil;
for (char32 c = 0x0BCA; c <= 0x0BCC; ++c) (*smap)[c] = U_Tamil;
(*smap)[0x0BCD] = U_Tamil;
(*smap)[0x0BD0] = U_Tamil;
(*smap)[0x0BD7] = U_Tamil;
for (char32 c = 0x0BE6; c <= 0x0BEF; ++c) (*smap)[c] = U_Tamil;
for (char32 c = 0x0BF0; c <= 0x0BF2; ++c) (*smap)[c] = U_Tamil;
for (char32 c = 0x0BF3; c <= 0x0BF8; ++c) (*smap)[c] = U_Tamil;
(*smap)[0x0BF9] = U_Tamil;
(*smap)[0x0BFA] = U_Tamil;
(*smap)[0x0C00] = U_Telugu;
for (char32 c = 0x0C01; c <= 0x0C03; ++c) (*smap)[c] = U_Telugu;
for (char32 c = 0x0C05; c <= 0x0C0C; ++c) (*smap)[c] = U_Telugu;
for (char32 c = 0x0C0E; c <= 0x0C10; ++c) (*smap)[c] = U_Telugu;
for (char32 c = 0x0C12; c <= 0x0C28; ++c) (*smap)[c] = U_Telugu;
for (char32 c = 0x0C2A; c <= 0x0C39; ++c) (*smap)[c] = U_Telugu;
(*smap)[0x0C3D] = U_Telugu;
for (char32 c = 0x0C3E; c <= 0x0C40; ++c) (*smap)[c] = U_Telugu;
for (char32 c = 0x0C41; c <= 0x0C44; ++c) (*smap)[c] = U_Telugu;
for (char32 c = 0x0C46; c <= 0x0C48; ++c) (*smap)[c] = U_Telugu;
for (char32 c = 0x0C4A; c <= 0x0C4D; ++c) (*smap)[c] = U_Telugu;
for (char32 c = 0x0C55; c <= 0x0C56; ++c) (*smap)[c] = U_Telugu;
for (char32 c = 0x0C58; c <= 0x0C5A; ++c) (*smap)[c] = U_Telugu;
for (char32 c = 0x0C60; c <= 0x0C61; ++c) (*smap)[c] = U_Telugu;
for (char32 c = 0x0C62; c <= 0x0C63; ++c) (*smap)[c] = U_Telugu;
for (char32 c = 0x0C66; c <= 0x0C6F; ++c) (*smap)[c] = U_Telugu;
for (char32 c = 0x0C78; c <= 0x0C7E; ++c) (*smap)[c] = U_Telugu;
(*smap)[0x0C7F] = U_Telugu;
(*smap)[0x0C80] = U_Kannada;
(*smap)[0x0C81] = U_Kannada;
for (char32 c = 0x0C82; c <= 0x0C83; ++c) (*smap)[c] = U_Kannada;
for (char32 c = 0x0C85; c <= 0x0C8C; ++c) (*smap)[c] = U_Kannada;
for (char32 c = 0x0C8E; c <= 0x0C90; ++c) (*smap)[c] = U_Kannada;
for (char32 c = 0x0C92; c <= 0x0CA8; ++c) (*smap)[c] = U_Kannada;
for (char32 c = 0x0CAA; c <= 0x0CB3; ++c) (*smap)[c] = U_Kannada;
for (char32 c = 0x0CB5; c <= 0x0CB9; ++c) (*smap)[c] = U_Kannada;
(*smap)[0x0CBC] = U_Kannada;
(*smap)[0x0CBD] = U_Kannada;
(*smap)[0x0CBE] = U_Kannada;
(*smap)[0x0CBF] = U_Kannada;
for (char32 c = 0x0CC0; c <= 0x0CC4; ++c) (*smap)[c] = U_Kannada;
(*smap)[0x0CC6] = U_Kannada;
for (char32 c = 0x0CC7; c <= 0x0CC8; ++c) (*smap)[c] = U_Kannada;
for (char32 c = 0x0CCA; c <= 0x0CCB; ++c) (*smap)[c] = U_Kannada;
for (char32 c = 0x0CCC; c <= 0x0CCD; ++c) (*smap)[c] = U_Kannada;
for (char32 c = 0x0CD5; c <= 0x0CD6; ++c) (*smap)[c] = U_Kannada;
(*smap)[0x0CDE] = U_Kannada;
for (char32 c = 0x0CE0; c <= 0x0CE1; ++c) (*smap)[c] = U_Kannada;
for (char32 c = 0x0CE2; c <= 0x0CE3; ++c) (*smap)[c] = U_Kannada;
for (char32 c = 0x0CE6; c <= 0x0CEF; ++c) (*smap)[c] = U_Kannada;
for (char32 c = 0x0CF1; c <= 0x0CF2; ++c) (*smap)[c] = U_Kannada;
(*smap)[0x0D01] = U_Malayalam;
for (char32 c = 0x0D02; c <= 0x0D03; ++c) (*smap)[c] = U_Malayalam;
for (char32 c = 0x0D05; c <= 0x0D0C; ++c) (*smap)[c] = U_Malayalam;
for (char32 c = 0x0D0E; c <= 0x0D10; ++c) (*smap)[c] = U_Malayalam;
for (char32 c = 0x0D12; c <= 0x0D3A; ++c) (*smap)[c] = U_Malayalam;
(*smap)[0x0D3D] = U_Malayalam;
for (char32 c = 0x0D3E; c <= 0x0D40; ++c) (*smap)[c] = U_Malayalam;
for (char32 c = 0x0D41; c <= 0x0D44; ++c) (*smap)[c] = U_Malayalam;
for (char32 c = 0x0D46; c <= 0x0D48; ++c) (*smap)[c] = U_Malayalam;
for (char32 c = 0x0D4A; c <= 0x0D4C; ++c) (*smap)[c] = U_Malayalam;
(*smap)[0x0D4D] = U_Malayalam;
(*smap)[0x0D4E] = U_Malayalam;
(*smap)[0x0D4F] = U_Malayalam;
for (char32 c = 0x0D54; c <= 0x0D56; ++c) (*smap)[c] = U_Malayalam;
(*smap)[0x0D57] = U_Malayalam;
for (char32 c = 0x0D58; c <= 0x0D5E; ++c) (*smap)[c] = U_Malayalam;
for (char32 c = 0x0D5F; c <= 0x0D61; ++c) (*smap)[c] = U_Malayalam;
for (char32 c = 0x0D62; c <= 0x0D63; ++c) (*smap)[c] = U_Malayalam;
for (char32 c = 0x0D66; c <= 0x0D6F; ++c) (*smap)[c] = U_Malayalam;
for (char32 c = 0x0D70; c <= 0x0D78; ++c) (*smap)[c] = U_Malayalam;
(*smap)[0x0D79] = U_Malayalam;
for (char32 c = 0x0D7A; c <= 0x0D7F; ++c) (*smap)[c] = U_Malayalam;
for (char32 c = 0x0D82; c <= 0x0D83; ++c) (*smap)[c] = U_Sinhala;
for (char32 c = 0x0D85; c <= 0x0D96; ++c) (*smap)[c] = U_Sinhala;
for (char32 c = 0x0D9A; c <= 0x0DB1; ++c) (*smap)[c] = U_Sinhala;
for (char32 c = 0x0DB3; c <= 0x0DBB; ++c) (*smap)[c] = U_Sinhala;
(*smap)[0x0DBD] = U_Sinhala;
for (char32 c = 0x0DC0; c <= 0x0DC6; ++c) (*smap)[c] = U_Sinhala;
(*smap)[0x0DCA] = U_Sinhala;
for (char32 c = 0x0DCF; c <= 0x0DD1; ++c) (*smap)[c] = U_Sinhala;
for (char32 c = 0x0DD2; c <= 0x0DD4; ++c) (*smap)[c] = U_Sinhala;
(*smap)[0x0DD6] = U_Sinhala;
for (char32 c = 0x0DD8; c <= 0x0DDF; ++c) (*smap)[c] = U_Sinhala;
for (char32 c = 0x0DE6; c <= 0x0DEF; ++c) (*smap)[c] = U_Sinhala;
for (char32 c = 0x0DF2; c <= 0x0DF3; ++c) (*smap)[c] = U_Sinhala;
(*smap)[0x0DF4] = U_Sinhala;
for (char32 c = 0x111E1; c <= 0x111F4; ++c) (*smap)[c] = U_Sinhala;
for (char32 c = 0x0E01; c <= 0x0E30; ++c) (*smap)[c] = U_Thai;
(*smap)[0x0E31] = U_Thai;
for (char32 c = 0x0E32; c <= 0x0E33; ++c) (*smap)[c] = U_Thai;
for (char32 c = 0x0E34; c <= 0x0E3A; ++c) (*smap)[c] = U_Thai;
for (char32 c = 0x0E40; c <= 0x0E45; ++c) (*smap)[c] = U_Thai;
(*smap)[0x0E46] = U_Thai;
for (char32 c = 0x0E47; c <= 0x0E4E; ++c) (*smap)[c] = U_Thai;
(*smap)[0x0E4F] = U_Thai;
for (char32 c = 0x0E50; c <= 0x0E59; ++c) (*smap)[c] = U_Thai;
for (char32 c = 0x0E5A; c <= 0x0E5B; ++c) (*smap)[c] = U_Thai;
for (char32 c = 0x0E81; c <= 0x0E82; ++c) (*smap)[c] = U_Lao;
(*smap)[0x0E84] = U_Lao;
for (char32 c = 0x0E87; c <= 0x0E88; ++c) (*smap)[c] = U_Lao;
(*smap)[0x0E8A] = U_Lao;
(*smap)[0x0E8D] = U_Lao;
for (char32 c = 0x0E94; c <= 0x0E97; ++c) (*smap)[c] = U_Lao;
for (char32 c = 0x0E99; c <= 0x0E9F; ++c) (*smap)[c] = U_Lao;
for (char32 c = 0x0EA1; c <= 0x0EA3; ++c) (*smap)[c] = U_Lao;
(*smap)[0x0EA5] = U_Lao;
(*smap)[0x0EA7] = U_Lao;
for (char32 c = 0x0EAA; c <= 0x0EAB; ++c) (*smap)[c] = U_Lao;
for (char32 c = 0x0EAD; c <= 0x0EB0; ++c) (*smap)[c] = U_Lao;
(*smap)[0x0EB1] = U_Lao;
for (char32 c = 0x0EB2; c <= 0x0EB3; ++c) (*smap)[c] = U_Lao;
for (char32 c = 0x0EB4; c <= 0x0EB9; ++c) (*smap)[c] = U_Lao;
for (char32 c = 0x0EBB; c <= 0x0EBC; ++c) (*smap)[c] = U_Lao;
(*smap)[0x0EBD] = U_Lao;
for (char32 c = 0x0EC0; c <= 0x0EC4; ++c) (*smap)[c] = U_Lao;
(*smap)[0x0EC6] = U_Lao;
for (char32 c = 0x0EC8; c <= 0x0ECD; ++c) (*smap)[c] = U_Lao;
for (char32 c = 0x0ED0; c <= 0x0ED9; ++c) (*smap)[c] = U_Lao;
for (char32 c = 0x0EDC; c <= 0x0EDF; ++c) (*smap)[c] = U_Lao;
(*smap)[0x0F00] = U_Tibetan;
for (char32 c = 0x0F01; c <= 0x0F03; ++c) (*smap)[c] = U_Tibetan;
for (char32 c = 0x0F04; c <= 0x0F12; ++c) (*smap)[c] = U_Tibetan;
(*smap)[0x0F13] = U_Tibetan;
(*smap)[0x0F14] = U_Tibetan;
for (char32 c = 0x0F15; c <= 0x0F17; ++c) (*smap)[c] = U_Tibetan;
for (char32 c = 0x0F18; c <= 0x0F19; ++c) (*smap)[c] = U_Tibetan;
for (char32 c = 0x0F1A; c <= 0x0F1F; ++c) (*smap)[c] = U_Tibetan;
for (char32 c = 0x0F20; c <= 0x0F29; ++c) (*smap)[c] = U_Tibetan;
for (char32 c = 0x0F2A; c <= 0x0F33; ++c) (*smap)[c] = U_Tibetan;
(*smap)[0x0F34] = U_Tibetan;
(*smap)[0x0F35] = U_Tibetan;
(*smap)[0x0F36] = U_Tibetan;
(*smap)[0x0F37] = U_Tibetan;
(*smap)[0x0F38] = U_Tibetan;
(*smap)[0x0F39] = U_Tibetan;
(*smap)[0x0F3A] = U_Tibetan;
(*smap)[0x0F3B] = U_Tibetan;
(*smap)[0x0F3C] = U_Tibetan;
(*smap)[0x0F3D] = U_Tibetan;
for (char32 c = 0x0F3E; c <= 0x0F3F; ++c) (*smap)[c] = U_Tibetan;
for (char32 c = 0x0F40; c <= 0x0F47; ++c) (*smap)[c] = U_Tibetan;
for (char32 c = 0x0F49; c <= 0x0F6C; ++c) (*smap)[c] = U_Tibetan;
for (char32 c = 0x0F71; c <= 0x0F7E; ++c) (*smap)[c] = U_Tibetan;
(*smap)[0x0F7F] = U_Tibetan;
for (char32 c = 0x0F80; c <= 0x0F84; ++c) (*smap)[c] = U_Tibetan;
(*smap)[0x0F85] = U_Tibetan;
for (char32 c = 0x0F86; c <= 0x0F87; ++c) (*smap)[c] = U_Tibetan;
for (char32 c = 0x0F88; c <= 0x0F8C; ++c) (*smap)[c] = U_Tibetan;
for (char32 c = 0x0F8D; c <= 0x0F97; ++c) (*smap)[c] = U_Tibetan;
for (char32 c = 0x0F99; c <= 0x0FBC; ++c) (*smap)[c] = U_Tibetan;
for (char32 c = 0x0FBE; c <= 0x0FC5; ++c) (*smap)[c] = U_Tibetan;
(*smap)[0x0FC6] = U_Tibetan;
for (char32 c = 0x0FC7; c <= 0x0FCC; ++c) (*smap)[c] = U_Tibetan;
for (char32 c = 0x0FCE; c <= 0x0FCF; ++c) (*smap)[c] = U_Tibetan;
for (char32 c = 0x0FD0; c <= 0x0FD4; ++c) (*smap)[c] = U_Tibetan;
for (char32 c = 0x0FD9; c <= 0x0FDA; ++c) (*smap)[c] = U_Tibetan;
for (char32 c = 0x1000; c <= 0x102A; ++c) (*smap)[c] = U_Myanmar;
for (char32 c = 0x102B; c <= 0x102C; ++c) (*smap)[c] = U_Myanmar;
for (char32 c = 0x102D; c <= 0x1030; ++c) (*smap)[c] = U_Myanmar;
(*smap)[0x1031] = U_Myanmar;
for (char32 c = 0x1032; c <= 0x1037; ++c) (*smap)[c] = U_Myanmar;
(*smap)[0x1038] = U_Myanmar;
for (char32 c = 0x1039; c <= 0x103A; ++c) (*smap)[c] = U_Myanmar;
for (char32 c = 0x103B; c <= 0x103C; ++c) (*smap)[c] = U_Myanmar;
for (char32 c = 0x103D; c <= 0x103E; ++c) (*smap)[c] = U_Myanmar;
(*smap)[0x103F] = U_Myanmar;
for (char32 c = 0x1040; c <= 0x1049; ++c) (*smap)[c] = U_Myanmar;
for (char32 c = 0x104A; c <= 0x104F; ++c) (*smap)[c] = U_Myanmar;
for (char32 c = 0x1050; c <= 0x1055; ++c) (*smap)[c] = U_Myanmar;
for (char32 c = 0x1056; c <= 0x1057; ++c) (*smap)[c] = U_Myanmar;
for (char32 c = 0x1058; c <= 0x1059; ++c) (*smap)[c] = U_Myanmar;
for (char32 c = 0x105A; c <= 0x105D; ++c) (*smap)[c] = U_Myanmar;
for (char32 c = 0x105E; c <= 0x1060; ++c) (*smap)[c] = U_Myanmar;
(*smap)[0x1061] = U_Myanmar;
for (char32 c = 0x1062; c <= 0x1064; ++c) (*smap)[c] = U_Myanmar;
for (char32 c = 0x1065; c <= 0x1066; ++c) (*smap)[c] = U_Myanmar;
for (char32 c = 0x1067; c <= 0x106D; ++c) (*smap)[c] = U_Myanmar;
for (char32 c = 0x106E; c <= 0x1070; ++c) (*smap)[c] = U_Myanmar;
for (char32 c = 0x1071; c <= 0x1074; ++c) (*smap)[c] = U_Myanmar;
for (char32 c = 0x1075; c <= 0x1081; ++c) (*smap)[c] = U_Myanmar;
(*smap)[0x1082] = U_Myanmar;
for (char32 c = 0x1083; c <= 0x1084; ++c) (*smap)[c] = U_Myanmar;
for (char32 c = 0x1085; c <= 0x1086; ++c) (*smap)[c] = U_Myanmar;
for (char32 c = 0x1087; c <= 0x108C; ++c) (*smap)[c] = U_Myanmar;
(*smap)[0x108D] = U_Myanmar;
(*smap)[0x108E] = U_Myanmar;
(*smap)[0x108F] = U_Myanmar;
for (char32 c = 0x1090; c <= 0x1099; ++c) (*smap)[c] = U_Myanmar;
for (char32 c = 0x109A; c <= 0x109C; ++c) (*smap)[c] = U_Myanmar;
(*smap)[0x109D] = U_Myanmar;
for (char32 c = 0x109E; c <= 0x109F; ++c) (*smap)[c] = U_Myanmar;
for (char32 c = 0xA9E0; c <= 0xA9E4; ++c) (*smap)[c] = U_Myanmar;
(*smap)[0xA9E5] = U_Myanmar;
(*smap)[0xA9E6] = U_Myanmar;
for (char32 c = 0xA9E7; c <= 0xA9EF; ++c) (*smap)[c] = U_Myanmar;
for (char32 c = 0xA9F0; c <= 0xA9F9; ++c) (*smap)[c] = U_Myanmar;
for (char32 c = 0xA9FA; c <= 0xA9FE; ++c) (*smap)[c] = U_Myanmar;
for (char32 c = 0xAA60; c <= 0xAA6F; ++c) (*smap)[c] = U_Myanmar;
(*smap)[0xAA70] = U_Myanmar;
for (char32 c = 0xAA71; c <= 0xAA76; ++c) (*smap)[c] = U_Myanmar;
for (char32 c = 0xAA77; c <= 0xAA79; ++c) (*smap)[c] = U_Myanmar;
(*smap)[0xAA7A] = U_Myanmar;
(*smap)[0xAA7B] = U_Myanmar;
(*smap)[0xAA7C] = U_Myanmar;
(*smap)[0xAA7D] = U_Myanmar;
for (char32 c = 0xAA7E; c <= 0xAA7F; ++c) (*smap)[c] = U_Myanmar;
for (char32 c = 0x10A0; c <= 0x10C5; ++c) (*smap)[c] = U_Georgian;
(*smap)[0x10C7] = U_Georgian;
(*smap)[0x10CD] = U_Georgian;
for (char32 c = 0x10D0; c <= 0x10FA; ++c) (*smap)[c] = U_Georgian;
(*smap)[0x10FC] = U_Georgian;
for (char32 c = 0x10FD; c <= 0x10FF; ++c) (*smap)[c] = U_Georgian;
for (char32 c = 0x2D00; c <= 0x2D25; ++c) (*smap)[c] = U_Georgian;
(*smap)[0x2D27] = U_Georgian;
(*smap)[0x2D2D] = U_Georgian;
for (char32 c = 0x1100; c <= 0x11FF; ++c) (*smap)[c] = U_Hangul;
for (char32 c = 0x302E; c <= 0x302F; ++c) (*smap)[c] = U_Hangul;
for (char32 c = 0x3131; c <= 0x318E; ++c) (*smap)[c] = U_Hangul;
for (char32 c = 0x3200; c <= 0x321E; ++c) (*smap)[c] = U_Hangul;
for (char32 c = 0x3260; c <= 0x327E; ++c) (*smap)[c] = U_Hangul;
for (char32 c = 0xA960; c <= 0xA97C; ++c) (*smap)[c] = U_Hangul;
for (char32 c = 0xAC00; c <= 0xD7A3; ++c) (*smap)[c] = U_Hangul;
for (char32 c = 0xD7B0; c <= 0xD7C6; ++c) (*smap)[c] = U_Hangul;
for (char32 c = 0xD7CB; c <= 0xD7FB; ++c) (*smap)[c] = U_Hangul;
for (char32 c = 0xFFA0; c <= 0xFFBE; ++c) (*smap)[c] = U_Hangul;
for (char32 c = 0xFFC2; c <= 0xFFC7; ++c) (*smap)[c] = U_Hangul;
for (char32 c = 0xFFCA; c <= 0xFFCF; ++c) (*smap)[c] = U_Hangul;
for (char32 c = 0xFFD2; c <= 0xFFD7; ++c) (*smap)[c] = U_Hangul;
for (char32 c = 0xFFDA; c <= 0xFFDC; ++c) (*smap)[c] = U_Hangul;
for (char32 c = 0x1200; c <= 0x1248; ++c) (*smap)[c] = U_Ethiopic;
for (char32 c = 0x124A; c <= 0x124D; ++c) (*smap)[c] = U_Ethiopic;
for (char32 c = 0x1250; c <= 0x1256; ++c) (*smap)[c] = U_Ethiopic;
(*smap)[0x1258] = U_Ethiopic;
for (char32 c = 0x125A; c <= 0x125D; ++c) (*smap)[c] = U_Ethiopic;
for (char32 c = 0x1260; c <= 0x1288; ++c) (*smap)[c] = U_Ethiopic;
for (char32 c = 0x128A; c <= 0x128D; ++c) (*smap)[c] = U_Ethiopic;
for (char32 c = 0x1290; c <= 0x12B0; ++c) (*smap)[c] = U_Ethiopic;
for (char32 c = 0x12B2; c <= 0x12B5; ++c) (*smap)[c] = U_Ethiopic;
for (char32 c = 0x12B8; c <= 0x12BE; ++c) (*smap)[c] = U_Ethiopic;
(*smap)[0x12C0] = U_Ethiopic;
for (char32 c = 0x12C2; c <= 0x12C5; ++c) (*smap)[c] = U_Ethiopic;
for (char32 c = 0x12C8; c <= 0x12D6; ++c) (*smap)[c] = U_Ethiopic;
for (char32 c = 0x12D8; c <= 0x1310; ++c) (*smap)[c] = U_Ethiopic;
for (char32 c = 0x1312; c <= 0x1315; ++c) (*smap)[c] = U_Ethiopic;
for (char32 c = 0x1318; c <= 0x135A; ++c) (*smap)[c] = U_Ethiopic;
for (char32 c = 0x135D; c <= 0x135F; ++c) (*smap)[c] = U_Ethiopic;
for (char32 c = 0x1360; c <= 0x1368; ++c) (*smap)[c] = U_Ethiopic;
for (char32 c = 0x1369; c <= 0x137C; ++c) (*smap)[c] = U_Ethiopic;
for (char32 c = 0x1380; c <= 0x138F; ++c) (*smap)[c] = U_Ethiopic;
for (char32 c = 0x1390; c <= 0x1399; ++c) (*smap)[c] = U_Ethiopic;
for (char32 c = 0x2D80; c <= 0x2D96; ++c) (*smap)[c] = U_Ethiopic;
for (char32 c = 0x2DA0; c <= 0x2DA6; ++c) (*smap)[c] = U_Ethiopic;
for (char32 c = 0x2DA8; c <= 0x2DAE; ++c) (*smap)[c] = U_Ethiopic;
for (char32 c = 0x2DB0; c <= 0x2DB6; ++c) (*smap)[c] = U_Ethiopic;
for (char32 c = 0x2DB8; c <= 0x2DBE; ++c) (*smap)[c] = U_Ethiopic;
for (char32 c = 0x2DC0; c <= 0x2DC6; ++c) (*smap)[c] = U_Ethiopic;
for (char32 c = 0x2DC8; c <= 0x2DCE; ++c) (*smap)[c] = U_Ethiopic;
for (char32 c = 0x2DD0; c <= 0x2DD6; ++c) (*smap)[c] = U_Ethiopic;
for (char32 c = 0x2DD8; c <= 0x2DDE; ++c) (*smap)[c] = U_Ethiopic;
for (char32 c = 0xAB01; c <= 0xAB06; ++c) (*smap)[c] = U_Ethiopic;
for (char32 c = 0xAB09; c <= 0xAB0E; ++c) (*smap)[c] = U_Ethiopic;
for (char32 c = 0xAB11; c <= 0xAB16; ++c) (*smap)[c] = U_Ethiopic;
for (char32 c = 0xAB20; c <= 0xAB26; ++c) (*smap)[c] = U_Ethiopic;
for (char32 c = 0xAB28; c <= 0xAB2E; ++c) (*smap)[c] = U_Ethiopic;
for (char32 c = 0x13A0; c <= 0x13F5; ++c) (*smap)[c] = U_Cherokee;
for (char32 c = 0x13F8; c <= 0x13FD; ++c) (*smap)[c] = U_Cherokee;
for (char32 c = 0xAB70; c <= 0xABBF; ++c) (*smap)[c] = U_Cherokee;
(*smap)[0x1400] = U_Canadian_Aboriginal;
for (char32 c = 0x1401; c <= 0x166C; ++c) (*smap)[c] = U_Canadian_Aboriginal;
for (char32 c = 0x166D; c <= 0x166E; ++c) (*smap)[c] = U_Canadian_Aboriginal;
for (char32 c = 0x166F; c <= 0x167F; ++c) (*smap)[c] = U_Canadian_Aboriginal;
for (char32 c = 0x18B0; c <= 0x18F5; ++c) (*smap)[c] = U_Canadian_Aboriginal;
(*smap)[0x1680] = U_Ogham;
for (char32 c = 0x1681; c <= 0x169A; ++c) (*smap)[c] = U_Ogham;
(*smap)[0x169B] = U_Ogham;
(*smap)[0x169C] = U_Ogham;
for (char32 c = 0x16A0; c <= 0x16EA; ++c) (*smap)[c] = U_Runic;
for (char32 c = 0x16EE; c <= 0x16F0; ++c) (*smap)[c] = U_Runic;
for (char32 c = 0x16F1; c <= 0x16F8; ++c) (*smap)[c] = U_Runic;
for (char32 c = 0x1780; c <= 0x17B3; ++c) (*smap)[c] = U_Khmer;
for (char32 c = 0x17B4; c <= 0x17B5; ++c) (*smap)[c] = U_Khmer;
(*smap)[0x17B6] = U_Khmer;
for (char32 c = 0x17B7; c <= 0x17BD; ++c) (*smap)[c] = U_Khmer;
for (char32 c = 0x17BE; c <= 0x17C5; ++c) (*smap)[c] = U_Khmer;
(*smap)[0x17C6] = U_Khmer;
for (char32 c = 0x17C7; c <= 0x17C8; ++c) (*smap)[c] = U_Khmer;
for (char32 c = 0x17C9; c <= 0x17D3; ++c) (*smap)[c] = U_Khmer;
for (char32 c = 0x17D4; c <= 0x17D6; ++c) (*smap)[c] = U_Khmer;
(*smap)[0x17D7] = U_Khmer;
for (char32 c = 0x17D8; c <= 0x17DA; ++c) (*smap)[c] = U_Khmer;
(*smap)[0x17DB] = U_Khmer;
(*smap)[0x17DC] = U_Khmer;
(*smap)[0x17DD] = U_Khmer;
for (char32 c = 0x17E0; c <= 0x17E9; ++c) (*smap)[c] = U_Khmer;
for (char32 c = 0x17F0; c <= 0x17F9; ++c) (*smap)[c] = U_Khmer;
for (char32 c = 0x19E0; c <= 0x19FF; ++c) (*smap)[c] = U_Khmer;
for (char32 c = 0x1800; c <= 0x1801; ++c) (*smap)[c] = U_Mongolian;
(*smap)[0x1804] = U_Mongolian;
(*smap)[0x1806] = U_Mongolian;
for (char32 c = 0x1807; c <= 0x180A; ++c) (*smap)[c] = U_Mongolian;
for (char32 c = 0x180B; c <= 0x180D; ++c) (*smap)[c] = U_Mongolian;
(*smap)[0x180E] = U_Mongolian;
for (char32 c = 0x1810; c <= 0x1819; ++c) (*smap)[c] = U_Mongolian;
for (char32 c = 0x1820; c <= 0x1842; ++c) (*smap)[c] = U_Mongolian;
(*smap)[0x1843] = U_Mongolian;
for (char32 c = 0x1844; c <= 0x1877; ++c) (*smap)[c] = U_Mongolian;
for (char32 c = 0x1880; c <= 0x1884; ++c) (*smap)[c] = U_Mongolian;
for (char32 c = 0x1885; c <= 0x1886; ++c) (*smap)[c] = U_Mongolian;
for (char32 c = 0x1887; c <= 0x18A8; ++c) (*smap)[c] = U_Mongolian;
(*smap)[0x18A9] = U_Mongolian;
(*smap)[0x18AA] = U_Mongolian;
for (char32 c = 0x11660; c <= 0x1166C; ++c) (*smap)[c] = U_Mongolian;
for (char32 c = 0x3041; c <= 0x3096; ++c) (*smap)[c] = U_Hiragana;
for (char32 c = 0x309D; c <= 0x309E; ++c) (*smap)[c] = U_Hiragana;
(*smap)[0x309F] = U_Hiragana;
(*smap)[0x1B001] = U_Hiragana;
(*smap)[0x1F200] = U_Hiragana;
for (char32 c = 0x30A1; c <= 0x30FA; ++c) (*smap)[c] = U_Katakana;
for (char32 c = 0x30FD; c <= 0x30FE; ++c) (*smap)[c] = U_Katakana;
(*smap)[0x30FF] = U_Katakana;
for (char32 c = 0x31F0; c <= 0x31FF; ++c) (*smap)[c] = U_Katakana;
for (char32 c = 0x32D0; c <= 0x32FE; ++c) (*smap)[c] = U_Katakana;
for (char32 c = 0x3300; c <= 0x3357; ++c) (*smap)[c] = U_Katakana;
for (char32 c = 0xFF66; c <= 0xFF6F; ++c) (*smap)[c] = U_Katakana;
for (char32 c = 0xFF71; c <= 0xFF9D; ++c) (*smap)[c] = U_Katakana;
(*smap)[0x1B000] = U_Katakana;
for (char32 c = 0x02EA; c <= 0x02EB; ++c) (*smap)[c] = U_Bopomofo;
for (char32 c = 0x3105; c <= 0x312D; ++c) (*smap)[c] = U_Bopomofo;
for (char32 c = 0x31A0; c <= 0x31BA; ++c) (*smap)[c] = U_Bopomofo;
for (char32 c = 0x2E80; c <= 0x2E99; ++c) (*smap)[c] = U_Han;
for (char32 c = 0x2E9B; c <= 0x2EF3; ++c) (*smap)[c] = U_Han;
for (char32 c = 0x2F00; c <= 0x2FD5; ++c) (*smap)[c] = U_Han;
(*smap)[0x3005] = U_Han;
(*smap)[0x3007] = U_Han;
for (char32 c = 0x3021; c <= 0x3029; ++c) (*smap)[c] = U_Han;
for (char32 c = 0x3038; c <= 0x303A; ++c) (*smap)[c] = U_Han;
(*smap)[0x303B] = U_Han;
for (char32 c = 0x3400; c <= 0x4DB5; ++c) (*smap)[c] = U_Han;
for (char32 c = 0x4E00; c <= 0x9FD5; ++c) (*smap)[c] = U_Han;
for (char32 c = 0xF900; c <= 0xFA6D; ++c) (*smap)[c] = U_Han;
for (char32 c = 0xFA70; c <= 0xFAD9; ++c) (*smap)[c] = U_Han;
for (char32 c = 0x20000; c <= 0x2A6D6; ++c) (*smap)[c] = U_Han;
for (char32 c = 0x2A700; c <= 0x2B734; ++c) (*smap)[c] = U_Han;
for (char32 c = 0x2B740; c <= 0x2B81D; ++c) (*smap)[c] = U_Han;
for (char32 c = 0x2B820; c <= 0x2CEA1; ++c) (*smap)[c] = U_Han;
for (char32 c = 0x2F800; c <= 0x2FA1D; ++c) (*smap)[c] = U_Han;
for (char32 c = 0xA000; c <= 0xA014; ++c) (*smap)[c] = U_Yi;
(*smap)[0xA015] = U_Yi;
for (char32 c = 0xA016; c <= 0xA48C; ++c) (*smap)[c] = U_Yi;
for (char32 c = 0xA490; c <= 0xA4C6; ++c) (*smap)[c] = U_Yi;
for (char32 c = 0x10300; c <= 0x1031F; ++c) (*smap)[c] = U_Old_Italic;
for (char32 c = 0x10320; c <= 0x10323; ++c) (*smap)[c] = U_Old_Italic;
for (char32 c = 0x10330; c <= 0x10340; ++c) (*smap)[c] = U_Gothic;
(*smap)[0x10341] = U_Gothic;
for (char32 c = 0x10342; c <= 0x10349; ++c) (*smap)[c] = U_Gothic;
(*smap)[0x1034A] = U_Gothic;
for (char32 c = 0x10400; c <= 0x1044F; ++c) (*smap)[c] = U_Deseret;
for (char32 c = 0x0300; c <= 0x036F; ++c) (*smap)[c] = U_Inherited;
for (char32 c = 0x0485; c <= 0x0486; ++c) (*smap)[c] = U_Inherited;
for (char32 c = 0x064B; c <= 0x0655; ++c) (*smap)[c] = U_Inherited;
(*smap)[0x0670] = U_Inherited;
for (char32 c = 0x0951; c <= 0x0952; ++c) (*smap)[c] = U_Inherited;
for (char32 c = 0x1AB0; c <= 0x1ABD; ++c) (*smap)[c] = U_Inherited;
(*smap)[0x1ABE] = U_Inherited;
for (char32 c = 0x1CD0; c <= 0x1CD2; ++c) (*smap)[c] = U_Inherited;
for (char32 c = 0x1CD4; c <= 0x1CE0; ++c) (*smap)[c] = U_Inherited;
for (char32 c = 0x1CE2; c <= 0x1CE8; ++c) (*smap)[c] = U_Inherited;
(*smap)[0x1CED] = U_Inherited;
(*smap)[0x1CF4] = U_Inherited;
for (char32 c = 0x1CF8; c <= 0x1CF9; ++c) (*smap)[c] = U_Inherited;
for (char32 c = 0x1DC0; c <= 0x1DF5; ++c) (*smap)[c] = U_Inherited;
for (char32 c = 0x1DFB; c <= 0x1DFF; ++c) (*smap)[c] = U_Inherited;
for (char32 c = 0x200C; c <= 0x200D; ++c) (*smap)[c] = U_Inherited;
for (char32 c = 0x20D0; c <= 0x20DC; ++c) (*smap)[c] = U_Inherited;
for (char32 c = 0x20DD; c <= 0x20E0; ++c) (*smap)[c] = U_Inherited;
(*smap)[0x20E1] = U_Inherited;
for (char32 c = 0x20E2; c <= 0x20E4; ++c) (*smap)[c] = U_Inherited;
for (char32 c = 0x20E5; c <= 0x20F0; ++c) (*smap)[c] = U_Inherited;
for (char32 c = 0x302A; c <= 0x302D; ++c) (*smap)[c] = U_Inherited;
for (char32 c = 0x3099; c <= 0x309A; ++c) (*smap)[c] = U_Inherited;
for (char32 c = 0xFE00; c <= 0xFE0F; ++c) (*smap)[c] = U_Inherited;
for (char32 c = 0xFE20; c <= 0xFE2D; ++c) (*smap)[c] = U_Inherited;
(*smap)[0x101FD] = U_Inherited;
(*smap)[0x102E0] = U_Inherited;
for (char32 c = 0x1D167; c <= 0x1D169; ++c) (*smap)[c] = U_Inherited;
for (char32 c = 0x1D17B; c <= 0x1D182; ++c) (*smap)[c] = U_Inherited;
for (char32 c = 0x1D185; c <= 0x1D18B; ++c) (*smap)[c] = U_Inherited;
for (char32 c = 0x1D1AA; c <= 0x1D1AD; ++c) (*smap)[c] = U_Inherited;
for (char32 c = 0xE0100; c <= 0xE01EF; ++c) (*smap)[c] = U_Inherited;
for (char32 c = 0x1700; c <= 0x170C; ++c) (*smap)[c] = U_Tagalog;
for (char32 c = 0x170E; c <= 0x1711; ++c) (*smap)[c] = U_Tagalog;
for (char32 c = 0x1712; c <= 0x1714; ++c) (*smap)[c] = U_Tagalog;
for (char32 c = 0x1720; c <= 0x1731; ++c) (*smap)[c] = U_Hanunoo;
for (char32 c = 0x1732; c <= 0x1734; ++c) (*smap)[c] = U_Hanunoo;
for (char32 c = 0x1740; c <= 0x1751; ++c) (*smap)[c] = U_Buhid;
for (char32 c = 0x1752; c <= 0x1753; ++c) (*smap)[c] = U_Buhid;
for (char32 c = 0x1760; c <= 0x176C; ++c) (*smap)[c] = U_Tagbanwa;
for (char32 c = 0x176E; c <= 0x1770; ++c) (*smap)[c] = U_Tagbanwa;
for (char32 c = 0x1772; c <= 0x1773; ++c) (*smap)[c] = U_Tagbanwa;
for (char32 c = 0x1900; c <= 0x191E; ++c) (*smap)[c] = U_Limbu;
for (char32 c = 0x1920; c <= 0x1922; ++c) (*smap)[c] = U_Limbu;
for (char32 c = 0x1923; c <= 0x1926; ++c) (*smap)[c] = U_Limbu;
for (char32 c = 0x1927; c <= 0x1928; ++c) (*smap)[c] = U_Limbu;
for (char32 c = 0x1929; c <= 0x192B; ++c) (*smap)[c] = U_Limbu;
for (char32 c = 0x1930; c <= 0x1931; ++c) (*smap)[c] = U_Limbu;
(*smap)[0x1932] = U_Limbu;
for (char32 c = 0x1933; c <= 0x1938; ++c) (*smap)[c] = U_Limbu;
for (char32 c = 0x1939; c <= 0x193B; ++c) (*smap)[c] = U_Limbu;
(*smap)[0x1940] = U_Limbu;
for (char32 c = 0x1944; c <= 0x1945; ++c) (*smap)[c] = U_Limbu;
for (char32 c = 0x1946; c <= 0x194F; ++c) (*smap)[c] = U_Limbu;
for (char32 c = 0x1950; c <= 0x196D; ++c) (*smap)[c] = U_Tai_Le;
for (char32 c = 0x1970; c <= 0x1974; ++c) (*smap)[c] = U_Tai_Le;
for (char32 c = 0x10000; c <= 0x1000B; ++c) (*smap)[c] = U_Linear_B;
for (char32 c = 0x1000D; c <= 0x10026; ++c) (*smap)[c] = U_Linear_B;
for (char32 c = 0x10028; c <= 0x1003A; ++c) (*smap)[c] = U_Linear_B;
for (char32 c = 0x1003C; c <= 0x1003D; ++c) (*smap)[c] = U_Linear_B;
for (char32 c = 0x1003F; c <= 0x1004D; ++c) (*smap)[c] = U_Linear_B;
for (char32 c = 0x10050; c <= 0x1005D; ++c) (*smap)[c] = U_Linear_B;
for (char32 c = 0x10080; c <= 0x100FA; ++c) (*smap)[c] = U_Linear_B;
for (char32 c = 0x10380; c <= 0x1039D; ++c) (*smap)[c] = U_Ugaritic;
(*smap)[0x1039F] = U_Ugaritic;
for (char32 c = 0x10450; c <= 0x1047F; ++c) (*smap)[c] = U_Shavian;
for (char32 c = 0x10480; c <= 0x1049D; ++c) (*smap)[c] = U_Osmanya;
for (char32 c = 0x104A0; c <= 0x104A9; ++c) (*smap)[c] = U_Osmanya;
for (char32 c = 0x10800; c <= 0x10805; ++c) (*smap)[c] = U_Cypriot;
(*smap)[0x10808] = U_Cypriot;
for (char32 c = 0x1080A; c <= 0x10835; ++c) (*smap)[c] = U_Cypriot;
for (char32 c = 0x10837; c <= 0x10838; ++c) (*smap)[c] = U_Cypriot;
(*smap)[0x1083C] = U_Cypriot;
(*smap)[0x1083F] = U_Cypriot;
for (char32 c = 0x2800; c <= 0x28FF; ++c) (*smap)[c] = U_Braille;
for (char32 c = 0x1A00; c <= 0x1A16; ++c) (*smap)[c] = U_Buginese;
for (char32 c = 0x1A17; c <= 0x1A18; ++c) (*smap)[c] = U_Buginese;
for (char32 c = 0x1A19; c <= 0x1A1A; ++c) (*smap)[c] = U_Buginese;
(*smap)[0x1A1B] = U_Buginese;
for (char32 c = 0x1A1E; c <= 0x1A1F; ++c) (*smap)[c] = U_Buginese;
for (char32 c = 0x03E2; c <= 0x03EF; ++c) (*smap)[c] = U_Coptic;
for (char32 c = 0x2C80; c <= 0x2CE4; ++c) (*smap)[c] = U_Coptic;
for (char32 c = 0x2CE5; c <= 0x2CEA; ++c) (*smap)[c] = U_Coptic;
for (char32 c = 0x2CEB; c <= 0x2CEE; ++c) (*smap)[c] = U_Coptic;
for (char32 c = 0x2CEF; c <= 0x2CF1; ++c) (*smap)[c] = U_Coptic;
for (char32 c = 0x2CF2; c <= 0x2CF3; ++c) (*smap)[c] = U_Coptic;
for (char32 c = 0x2CF9; c <= 0x2CFC; ++c) (*smap)[c] = U_Coptic;
(*smap)[0x2CFD] = U_Coptic;
for (char32 c = 0x2CFE; c <= 0x2CFF; ++c) (*smap)[c] = U_Coptic;
for (char32 c = 0x1980; c <= 0x19AB; ++c) (*smap)[c] = U_New_Tai_Lue;
for (char32 c = 0x19B0; c <= 0x19C9; ++c) (*smap)[c] = U_New_Tai_Lue;
for (char32 c = 0x19D0; c <= 0x19D9; ++c) (*smap)[c] = U_New_Tai_Lue;
(*smap)[0x19DA] = U_New_Tai_Lue;
for (char32 c = 0x19DE; c <= 0x19DF; ++c) (*smap)[c] = U_New_Tai_Lue;
for (char32 c = 0x2C00; c <= 0x2C2E; ++c) (*smap)[c] = U_Glagolitic;
for (char32 c = 0x2C30; c <= 0x2C5E; ++c) (*smap)[c] = U_Glagolitic;
for (char32 c = 0x1E000; c <= 0x1E006; ++c) (*smap)[c] = U_Glagolitic;
for (char32 c = 0x1E008; c <= 0x1E018; ++c) (*smap)[c] = U_Glagolitic;
for (char32 c = 0x1E01B; c <= 0x1E021; ++c) (*smap)[c] = U_Glagolitic;
for (char32 c = 0x1E023; c <= 0x1E024; ++c) (*smap)[c] = U_Glagolitic;
for (char32 c = 0x1E026; c <= 0x1E02A; ++c) (*smap)[c] = U_Glagolitic;
for (char32 c = 0x2D30; c <= 0x2D67; ++c) (*smap)[c] = U_Tifinagh;
(*smap)[0x2D6F] = U_Tifinagh;
(*smap)[0x2D70] = U_Tifinagh;
(*smap)[0x2D7F] = U_Tifinagh;
for (char32 c = 0xA800; c <= 0xA801; ++c) (*smap)[c] = U_Syloti_Nagri;
(*smap)[0xA802] = U_Syloti_Nagri;
for (char32 c = 0xA803; c <= 0xA805; ++c) (*smap)[c] = U_Syloti_Nagri;
(*smap)[0xA806] = U_Syloti_Nagri;
for (char32 c = 0xA807; c <= 0xA80A; ++c) (*smap)[c] = U_Syloti_Nagri;
(*smap)[0xA80B] = U_Syloti_Nagri;
for (char32 c = 0xA80C; c <= 0xA822; ++c) (*smap)[c] = U_Syloti_Nagri;
for (char32 c = 0xA823; c <= 0xA824; ++c) (*smap)[c] = U_Syloti_Nagri;
for (char32 c = 0xA825; c <= 0xA826; ++c) (*smap)[c] = U_Syloti_Nagri;
(*smap)[0xA827] = U_Syloti_Nagri;
for (char32 c = 0xA828; c <= 0xA82B; ++c) (*smap)[c] = U_Syloti_Nagri;
for (char32 c = 0x103A0; c <= 0x103C3; ++c) (*smap)[c] = U_Old_Persian;
for (char32 c = 0x103C8; c <= 0x103CF; ++c) (*smap)[c] = U_Old_Persian;
(*smap)[0x103D0] = U_Old_Persian;
for (char32 c = 0x103D1; c <= 0x103D5; ++c) (*smap)[c] = U_Old_Persian;
(*smap)[0x10A00] = U_Kharoshthi;
for (char32 c = 0x10A01; c <= 0x10A03; ++c) (*smap)[c] = U_Kharoshthi;
for (char32 c = 0x10A05; c <= 0x10A06; ++c) (*smap)[c] = U_Kharoshthi;
for (char32 c = 0x10A0C; c <= 0x10A0F; ++c) (*smap)[c] = U_Kharoshthi;
for (char32 c = 0x10A10; c <= 0x10A13; ++c) (*smap)[c] = U_Kharoshthi;
for (char32 c = 0x10A15; c <= 0x10A17; ++c) (*smap)[c] = U_Kharoshthi;
for (char32 c = 0x10A19; c <= 0x10A33; ++c) (*smap)[c] = U_Kharoshthi;
for (char32 c = 0x10A38; c <= 0x10A3A; ++c) (*smap)[c] = U_Kharoshthi;
(*smap)[0x10A3F] = U_Kharoshthi;
for (char32 c = 0x10A40; c <= 0x10A47; ++c) (*smap)[c] = U_Kharoshthi;
for (char32 c = 0x10A50; c <= 0x10A58; ++c) (*smap)[c] = U_Kharoshthi;
for (char32 c = 0x1B00; c <= 0x1B03; ++c) (*smap)[c] = U_Balinese;
(*smap)[0x1B04] = U_Balinese;
for (char32 c = 0x1B05; c <= 0x1B33; ++c) (*smap)[c] = U_Balinese;
(*smap)[0x1B34] = U_Balinese;
(*smap)[0x1B35] = U_Balinese;
for (char32 c = 0x1B36; c <= 0x1B3A; ++c) (*smap)[c] = U_Balinese;
(*smap)[0x1B3B] = U_Balinese;
(*smap)[0x1B3C] = U_Balinese;
for (char32 c = 0x1B3D; c <= 0x1B41; ++c) (*smap)[c] = U_Balinese;
(*smap)[0x1B42] = U_Balinese;
for (char32 c = 0x1B43; c <= 0x1B44; ++c) (*smap)[c] = U_Balinese;
for (char32 c = 0x1B45; c <= 0x1B4B; ++c) (*smap)[c] = U_Balinese;
for (char32 c = 0x1B50; c <= 0x1B59; ++c) (*smap)[c] = U_Balinese;
for (char32 c = 0x1B5A; c <= 0x1B60; ++c) (*smap)[c] = U_Balinese;
for (char32 c = 0x1B61; c <= 0x1B6A; ++c) (*smap)[c] = U_Balinese;
for (char32 c = 0x1B6B; c <= 0x1B73; ++c) (*smap)[c] = U_Balinese;
for (char32 c = 0x1B74; c <= 0x1B7C; ++c) (*smap)[c] = U_Balinese;
for (char32 c = 0x12000; c <= 0x12399; ++c) (*smap)[c] = U_Cuneiform;
for (char32 c = 0x12400; c <= 0x1246E; ++c) (*smap)[c] = U_Cuneiform;
for (char32 c = 0x12470; c <= 0x12474; ++c) (*smap)[c] = U_Cuneiform;
for (char32 c = 0x12480; c <= 0x12543; ++c) (*smap)[c] = U_Cuneiform;
for (char32 c = 0x10900; c <= 0x10915; ++c) (*smap)[c] = U_Phoenician;
for (char32 c = 0x10916; c <= 0x1091B; ++c) (*smap)[c] = U_Phoenician;
(*smap)[0x1091F] = U_Phoenician;
for (char32 c = 0xA840; c <= 0xA873; ++c) (*smap)[c] = U_Phags_Pa;
for (char32 c = 0xA874; c <= 0xA877; ++c) (*smap)[c] = U_Phags_Pa;
for (char32 c = 0x07C0; c <= 0x07C9; ++c) (*smap)[c] = U_Nko;
for (char32 c = 0x07CA; c <= 0x07EA; ++c) (*smap)[c] = U_Nko;
for (char32 c = 0x07EB; c <= 0x07F3; ++c) (*smap)[c] = U_Nko;
for (char32 c = 0x07F4; c <= 0x07F5; ++c) (*smap)[c] = U_Nko;
(*smap)[0x07F6] = U_Nko;
for (char32 c = 0x07F7; c <= 0x07F9; ++c) (*smap)[c] = U_Nko;
(*smap)[0x07FA] = U_Nko;
for (char32 c = 0x1B80; c <= 0x1B81; ++c) (*smap)[c] = U_Sundanese;
(*smap)[0x1B82] = U_Sundanese;
for (char32 c = 0x1B83; c <= 0x1BA0; ++c) (*smap)[c] = U_Sundanese;
(*smap)[0x1BA1] = U_Sundanese;
for (char32 c = 0x1BA2; c <= 0x1BA5; ++c) (*smap)[c] = U_Sundanese;
for (char32 c = 0x1BA6; c <= 0x1BA7; ++c) (*smap)[c] = U_Sundanese;
for (char32 c = 0x1BA8; c <= 0x1BA9; ++c) (*smap)[c] = U_Sundanese;
(*smap)[0x1BAA] = U_Sundanese;
for (char32 c = 0x1BAB; c <= 0x1BAD; ++c) (*smap)[c] = U_Sundanese;
for (char32 c = 0x1BAE; c <= 0x1BAF; ++c) (*smap)[c] = U_Sundanese;
for (char32 c = 0x1BB0; c <= 0x1BB9; ++c) (*smap)[c] = U_Sundanese;
for (char32 c = 0x1BBA; c <= 0x1BBF; ++c) (*smap)[c] = U_Sundanese;
for (char32 c = 0x1CC0; c <= 0x1CC7; ++c) (*smap)[c] = U_Sundanese;
for (char32 c = 0x1C00; c <= 0x1C23; ++c) (*smap)[c] = U_Lepcha;
for (char32 c = 0x1C24; c <= 0x1C2B; ++c) (*smap)[c] = U_Lepcha;
for (char32 c = 0x1C2C; c <= 0x1C33; ++c) (*smap)[c] = U_Lepcha;
for (char32 c = 0x1C34; c <= 0x1C35; ++c) (*smap)[c] = U_Lepcha;
for (char32 c = 0x1C36; c <= 0x1C37; ++c) (*smap)[c] = U_Lepcha;
for (char32 c = 0x1C3B; c <= 0x1C3F; ++c) (*smap)[c] = U_Lepcha;
for (char32 c = 0x1C40; c <= 0x1C49; ++c) (*smap)[c] = U_Lepcha;
for (char32 c = 0x1C4D; c <= 0x1C4F; ++c) (*smap)[c] = U_Lepcha;
for (char32 c = 0x1C50; c <= 0x1C59; ++c) (*smap)[c] = U_Ol_Chiki;
for (char32 c = 0x1C5A; c <= 0x1C77; ++c) (*smap)[c] = U_Ol_Chiki;
for (char32 c = 0x1C78; c <= 0x1C7D; ++c) (*smap)[c] = U_Ol_Chiki;
for (char32 c = 0x1C7E; c <= 0x1C7F; ++c) (*smap)[c] = U_Ol_Chiki;
for (char32 c = 0xA500; c <= 0xA60B; ++c) (*smap)[c] = U_Vai;
(*smap)[0xA60C] = U_Vai;
for (char32 c = 0xA60D; c <= 0xA60F; ++c) (*smap)[c] = U_Vai;
for (char32 c = 0xA610; c <= 0xA61F; ++c) (*smap)[c] = U_Vai;
for (char32 c = 0xA620; c <= 0xA629; ++c) (*smap)[c] = U_Vai;
for (char32 c = 0xA62A; c <= 0xA62B; ++c) (*smap)[c] = U_Vai;
for (char32 c = 0xA880; c <= 0xA881; ++c) (*smap)[c] = U_Saurashtra;
for (char32 c = 0xA882; c <= 0xA8B3; ++c) (*smap)[c] = U_Saurashtra;
for (char32 c = 0xA8B4; c <= 0xA8C3; ++c) (*smap)[c] = U_Saurashtra;
for (char32 c = 0xA8C4; c <= 0xA8C5; ++c) (*smap)[c] = U_Saurashtra;
for (char32 c = 0xA8CE; c <= 0xA8CF; ++c) (*smap)[c] = U_Saurashtra;
for (char32 c = 0xA8D0; c <= 0xA8D9; ++c) (*smap)[c] = U_Saurashtra;
for (char32 c = 0xA900; c <= 0xA909; ++c) (*smap)[c] = U_Kayah_Li;
for (char32 c = 0xA90A; c <= 0xA925; ++c) (*smap)[c] = U_Kayah_Li;
for (char32 c = 0xA926; c <= 0xA92D; ++c) (*smap)[c] = U_Kayah_Li;
(*smap)[0xA92F] = U_Kayah_Li;
for (char32 c = 0xA930; c <= 0xA946; ++c) (*smap)[c] = U_Rejang;
for (char32 c = 0xA947; c <= 0xA951; ++c) (*smap)[c] = U_Rejang;
for (char32 c = 0xA952; c <= 0xA953; ++c) (*smap)[c] = U_Rejang;
(*smap)[0xA95F] = U_Rejang;
for (char32 c = 0x10280; c <= 0x1029C; ++c) (*smap)[c] = U_Lycian;
for (char32 c = 0x102A0; c <= 0x102D0; ++c) (*smap)[c] = U_Carian;
for (char32 c = 0x10920; c <= 0x10939; ++c) (*smap)[c] = U_Lydian;
(*smap)[0x1093F] = U_Lydian;
for (char32 c = 0xAA00; c <= 0xAA28; ++c) (*smap)[c] = U_Cham;
for (char32 c = 0xAA29; c <= 0xAA2E; ++c) (*smap)[c] = U_Cham;
for (char32 c = 0xAA2F; c <= 0xAA30; ++c) (*smap)[c] = U_Cham;
for (char32 c = 0xAA31; c <= 0xAA32; ++c) (*smap)[c] = U_Cham;
for (char32 c = 0xAA33; c <= 0xAA34; ++c) (*smap)[c] = U_Cham;
for (char32 c = 0xAA35; c <= 0xAA36; ++c) (*smap)[c] = U_Cham;
for (char32 c = 0xAA40; c <= 0xAA42; ++c) (*smap)[c] = U_Cham;
(*smap)[0xAA43] = U_Cham;
for (char32 c = 0xAA44; c <= 0xAA4B; ++c) (*smap)[c] = U_Cham;
(*smap)[0xAA4C] = U_Cham;
(*smap)[0xAA4D] = U_Cham;
for (char32 c = 0xAA50; c <= 0xAA59; ++c) (*smap)[c] = U_Cham;
for (char32 c = 0xAA5C; c <= 0xAA5F; ++c) (*smap)[c] = U_Cham;
for (char32 c = 0x1A20; c <= 0x1A54; ++c) (*smap)[c] = U_Tai_Tham;
(*smap)[0x1A55] = U_Tai_Tham;
(*smap)[0x1A56] = U_Tai_Tham;
(*smap)[0x1A57] = U_Tai_Tham;
for (char32 c = 0x1A58; c <= 0x1A5E; ++c) (*smap)[c] = U_Tai_Tham;
(*smap)[0x1A60] = U_Tai_Tham;
(*smap)[0x1A61] = U_Tai_Tham;
(*smap)[0x1A62] = U_Tai_Tham;
for (char32 c = 0x1A63; c <= 0x1A64; ++c) (*smap)[c] = U_Tai_Tham;
for (char32 c = 0x1A65; c <= 0x1A6C; ++c) (*smap)[c] = U_Tai_Tham;
for (char32 c = 0x1A6D; c <= 0x1A72; ++c) (*smap)[c] = U_Tai_Tham;
for (char32 c = 0x1A73; c <= 0x1A7C; ++c) (*smap)[c] = U_Tai_Tham;
(*smap)[0x1A7F] = U_Tai_Tham;
for (char32 c = 0x1A80; c <= 0x1A89; ++c) (*smap)[c] = U_Tai_Tham;
for (char32 c = 0x1A90; c <= 0x1A99; ++c) (*smap)[c] = U_Tai_Tham;
for (char32 c = 0x1AA0; c <= 0x1AA6; ++c) (*smap)[c] = U_Tai_Tham;
(*smap)[0x1AA7] = U_Tai_Tham;
for (char32 c = 0x1AA8; c <= 0x1AAD; ++c) (*smap)[c] = U_Tai_Tham;
for (char32 c = 0xAA80; c <= 0xAAAF; ++c) (*smap)[c] = U_Tai_Viet;
(*smap)[0xAAB0] = U_Tai_Viet;
(*smap)[0xAAB1] = U_Tai_Viet;
for (char32 c = 0xAAB2; c <= 0xAAB4; ++c) (*smap)[c] = U_Tai_Viet;
for (char32 c = 0xAAB5; c <= 0xAAB6; ++c) (*smap)[c] = U_Tai_Viet;
for (char32 c = 0xAAB7; c <= 0xAAB8; ++c) (*smap)[c] = U_Tai_Viet;
for (char32 c = 0xAAB9; c <= 0xAABD; ++c) (*smap)[c] = U_Tai_Viet;
for (char32 c = 0xAABE; c <= 0xAABF; ++c) (*smap)[c] = U_Tai_Viet;
(*smap)[0xAAC0] = U_Tai_Viet;
(*smap)[0xAAC1] = U_Tai_Viet;
(*smap)[0xAAC2] = U_Tai_Viet;
for (char32 c = 0xAADB; c <= 0xAADC; ++c) (*smap)[c] = U_Tai_Viet;
(*smap)[0xAADD] = U_Tai_Viet;
for (char32 c = 0xAADE; c <= 0xAADF; ++c) (*smap)[c] = U_Tai_Viet;
for (char32 c = 0x10B00; c <= 0x10B35; ++c) (*smap)[c] = U_Avestan;
for (char32 c = 0x10B39; c <= 0x10B3F; ++c) (*smap)[c] = U_Avestan;
for (char32 c = 0x13000; c <= 0x1342E; ++c)
(*smap)[c] = U_Egyptian_Hieroglyphs;
for (char32 c = 0x0800; c <= 0x0815; ++c) (*smap)[c] = U_Samaritan;
for (char32 c = 0x0816; c <= 0x0819; ++c) (*smap)[c] = U_Samaritan;
(*smap)[0x081A] = U_Samaritan;
for (char32 c = 0x081B; c <= 0x0823; ++c) (*smap)[c] = U_Samaritan;
(*smap)[0x0824] = U_Samaritan;
for (char32 c = 0x0825; c <= 0x0827; ++c) (*smap)[c] = U_Samaritan;
(*smap)[0x0828] = U_Samaritan;
for (char32 c = 0x0829; c <= 0x082D; ++c) (*smap)[c] = U_Samaritan;
for (char32 c = 0x0830; c <= 0x083E; ++c) (*smap)[c] = U_Samaritan;
for (char32 c = 0xA4D0; c <= 0xA4F7; ++c) (*smap)[c] = U_Lisu;
for (char32 c = 0xA4F8; c <= 0xA4FD; ++c) (*smap)[c] = U_Lisu;
for (char32 c = 0xA4FE; c <= 0xA4FF; ++c) (*smap)[c] = U_Lisu;
for (char32 c = 0xA6A0; c <= 0xA6E5; ++c) (*smap)[c] = U_Bamum;
for (char32 c = 0xA6E6; c <= 0xA6EF; ++c) (*smap)[c] = U_Bamum;
for (char32 c = 0xA6F0; c <= 0xA6F1; ++c) (*smap)[c] = U_Bamum;
for (char32 c = 0xA6F2; c <= 0xA6F7; ++c) (*smap)[c] = U_Bamum;
for (char32 c = 0x16800; c <= 0x16A38; ++c) (*smap)[c] = U_Bamum;
for (char32 c = 0xA980; c <= 0xA982; ++c) (*smap)[c] = U_Javanese;
(*smap)[0xA983] = U_Javanese;
for (char32 c = 0xA984; c <= 0xA9B2; ++c) (*smap)[c] = U_Javanese;
(*smap)[0xA9B3] = U_Javanese;
for (char32 c = 0xA9B4; c <= 0xA9B5; ++c) (*smap)[c] = U_Javanese;
for (char32 c = 0xA9B6; c <= 0xA9B9; ++c) (*smap)[c] = U_Javanese;
for (char32 c = 0xA9BA; c <= 0xA9BB; ++c) (*smap)[c] = U_Javanese;
(*smap)[0xA9BC] = U_Javanese;
for (char32 c = 0xA9BD; c <= 0xA9C0; ++c) (*smap)[c] = U_Javanese;
for (char32 c = 0xA9C1; c <= 0xA9CD; ++c) (*smap)[c] = U_Javanese;
for (char32 c = 0xA9D0; c <= 0xA9D9; ++c) (*smap)[c] = U_Javanese;
for (char32 c = 0xA9DE; c <= 0xA9DF; ++c) (*smap)[c] = U_Javanese;
for (char32 c = 0xAAE0; c <= 0xAAEA; ++c) (*smap)[c] = U_Meetei_Mayek;
(*smap)[0xAAEB] = U_Meetei_Mayek;
for (char32 c = 0xAAEC; c <= 0xAAED; ++c) (*smap)[c] = U_Meetei_Mayek;
for (char32 c = 0xAAEE; c <= 0xAAEF; ++c) (*smap)[c] = U_Meetei_Mayek;
for (char32 c = 0xAAF0; c <= 0xAAF1; ++c) (*smap)[c] = U_Meetei_Mayek;
(*smap)[0xAAF2] = U_Meetei_Mayek;
for (char32 c = 0xAAF3; c <= 0xAAF4; ++c) (*smap)[c] = U_Meetei_Mayek;
(*smap)[0xAAF5] = U_Meetei_Mayek;
(*smap)[0xAAF6] = U_Meetei_Mayek;
for (char32 c = 0xABC0; c <= 0xABE2; ++c) (*smap)[c] = U_Meetei_Mayek;
for (char32 c = 0xABE3; c <= 0xABE4; ++c) (*smap)[c] = U_Meetei_Mayek;
(*smap)[0xABE5] = U_Meetei_Mayek;
for (char32 c = 0xABE6; c <= 0xABE7; ++c) (*smap)[c] = U_Meetei_Mayek;
(*smap)[0xABE8] = U_Meetei_Mayek;
for (char32 c = 0xABE9; c <= 0xABEA; ++c) (*smap)[c] = U_Meetei_Mayek;
(*smap)[0xABEB] = U_Meetei_Mayek;
(*smap)[0xABEC] = U_Meetei_Mayek;
(*smap)[0xABED] = U_Meetei_Mayek;
for (char32 c = 0xABF0; c <= 0xABF9; ++c) (*smap)[c] = U_Meetei_Mayek;
for (char32 c = 0x10840; c <= 0x10855; ++c) (*smap)[c] = U_Imperial_Aramaic;
(*smap)[0x10857] = U_Imperial_Aramaic;
for (char32 c = 0x10858; c <= 0x1085F; ++c) (*smap)[c] = U_Imperial_Aramaic;
for (char32 c = 0x10A60; c <= 0x10A7C; ++c) (*smap)[c] = U_Old_South_Arabian;
for (char32 c = 0x10A7D; c <= 0x10A7E; ++c) (*smap)[c] = U_Old_South_Arabian;
(*smap)[0x10A7F] = U_Old_South_Arabian;
for (char32 c = 0x10B40; c <= 0x10B55; ++c)
(*smap)[c] = U_Inscriptional_Parthian;
for (char32 c = 0x10B58; c <= 0x10B5F; ++c)
(*smap)[c] = U_Inscriptional_Parthian;
for (char32 c = 0x10B60; c <= 0x10B72; ++c)
(*smap)[c] = U_Inscriptional_Pahlavi;
for (char32 c = 0x10B78; c <= 0x10B7F; ++c)
(*smap)[c] = U_Inscriptional_Pahlavi;
for (char32 c = 0x10C00; c <= 0x10C48; ++c) (*smap)[c] = U_Old_Turkic;
for (char32 c = 0x11080; c <= 0x11081; ++c) (*smap)[c] = U_Kaithi;
(*smap)[0x11082] = U_Kaithi;
for (char32 c = 0x11083; c <= 0x110AF; ++c) (*smap)[c] = U_Kaithi;
for (char32 c = 0x110B0; c <= 0x110B2; ++c) (*smap)[c] = U_Kaithi;
for (char32 c = 0x110B3; c <= 0x110B6; ++c) (*smap)[c] = U_Kaithi;
for (char32 c = 0x110B7; c <= 0x110B8; ++c) (*smap)[c] = U_Kaithi;
for (char32 c = 0x110B9; c <= 0x110BA; ++c) (*smap)[c] = U_Kaithi;
for (char32 c = 0x110BB; c <= 0x110BC; ++c) (*smap)[c] = U_Kaithi;
(*smap)[0x110BD] = U_Kaithi;
for (char32 c = 0x110BE; c <= 0x110C1; ++c) (*smap)[c] = U_Kaithi;
for (char32 c = 0x1BC0; c <= 0x1BE5; ++c) (*smap)[c] = U_Batak;
(*smap)[0x1BE6] = U_Batak;
(*smap)[0x1BE7] = U_Batak;
for (char32 c = 0x1BE8; c <= 0x1BE9; ++c) (*smap)[c] = U_Batak;
for (char32 c = 0x1BEA; c <= 0x1BEC; ++c) (*smap)[c] = U_Batak;
(*smap)[0x1BED] = U_Batak;
(*smap)[0x1BEE] = U_Batak;
for (char32 c = 0x1BEF; c <= 0x1BF1; ++c) (*smap)[c] = U_Batak;
for (char32 c = 0x1BF2; c <= 0x1BF3; ++c) (*smap)[c] = U_Batak;
for (char32 c = 0x1BFC; c <= 0x1BFF; ++c) (*smap)[c] = U_Batak;
(*smap)[0x11000] = U_Brahmi;
(*smap)[0x11001] = U_Brahmi;
(*smap)[0x11002] = U_Brahmi;
for (char32 c = 0x11003; c <= 0x11037; ++c) (*smap)[c] = U_Brahmi;
for (char32 c = 0x11038; c <= 0x11046; ++c) (*smap)[c] = U_Brahmi;
for (char32 c = 0x11047; c <= 0x1104D; ++c) (*smap)[c] = U_Brahmi;
for (char32 c = 0x11052; c <= 0x11065; ++c) (*smap)[c] = U_Brahmi;
for (char32 c = 0x11066; c <= 0x1106F; ++c) (*smap)[c] = U_Brahmi;
(*smap)[0x1107F] = U_Brahmi;
for (char32 c = 0x0840; c <= 0x0858; ++c) (*smap)[c] = U_Mandaic;
for (char32 c = 0x0859; c <= 0x085B; ++c) (*smap)[c] = U_Mandaic;
(*smap)[0x085E] = U_Mandaic;
for (char32 c = 0x11100; c <= 0x11102; ++c) (*smap)[c] = U_Chakma;
for (char32 c = 0x11103; c <= 0x11126; ++c) (*smap)[c] = U_Chakma;
for (char32 c = 0x11127; c <= 0x1112B; ++c) (*smap)[c] = U_Chakma;
(*smap)[0x1112C] = U_Chakma;
for (char32 c = 0x1112D; c <= 0x11134; ++c) (*smap)[c] = U_Chakma;
for (char32 c = 0x11136; c <= 0x1113F; ++c) (*smap)[c] = U_Chakma;
for (char32 c = 0x11140; c <= 0x11143; ++c) (*smap)[c] = U_Chakma;
for (char32 c = 0x109A0; c <= 0x109B7; ++c) (*smap)[c] = U_Meroitic_Cursive;
for (char32 c = 0x109BC; c <= 0x109BD; ++c) (*smap)[c] = U_Meroitic_Cursive;
for (char32 c = 0x109BE; c <= 0x109BF; ++c) (*smap)[c] = U_Meroitic_Cursive;
for (char32 c = 0x109C0; c <= 0x109CF; ++c) (*smap)[c] = U_Meroitic_Cursive;
for (char32 c = 0x109D2; c <= 0x109FF; ++c) (*smap)[c] = U_Meroitic_Cursive;
for (char32 c = 0x10980; c <= 0x1099F; ++c)
(*smap)[c] = U_Meroitic_Hieroglyphs;
for (char32 c = 0x16F00; c <= 0x16F44; ++c) (*smap)[c] = U_Miao;
(*smap)[0x16F50] = U_Miao;
for (char32 c = 0x16F51; c <= 0x16F7E; ++c) (*smap)[c] = U_Miao;
for (char32 c = 0x16F8F; c <= 0x16F92; ++c) (*smap)[c] = U_Miao;
for (char32 c = 0x16F93; c <= 0x16F9F; ++c) (*smap)[c] = U_Miao;
for (char32 c = 0x11180; c <= 0x11181; ++c) (*smap)[c] = U_Sharada;
(*smap)[0x11182] = U_Sharada;
for (char32 c = 0x11183; c <= 0x111B2; ++c) (*smap)[c] = U_Sharada;
for (char32 c = 0x111B3; c <= 0x111B5; ++c) (*smap)[c] = U_Sharada;
for (char32 c = 0x111B6; c <= 0x111BE; ++c) (*smap)[c] = U_Sharada;
for (char32 c = 0x111BF; c <= 0x111C0; ++c) (*smap)[c] = U_Sharada;
for (char32 c = 0x111C1; c <= 0x111C4; ++c) (*smap)[c] = U_Sharada;
for (char32 c = 0x111C5; c <= 0x111C9; ++c) (*smap)[c] = U_Sharada;
for (char32 c = 0x111CA; c <= 0x111CC; ++c) (*smap)[c] = U_Sharada;
(*smap)[0x111CD] = U_Sharada;
for (char32 c = 0x111D0; c <= 0x111D9; ++c) (*smap)[c] = U_Sharada;
(*smap)[0x111DA] = U_Sharada;
(*smap)[0x111DB] = U_Sharada;
(*smap)[0x111DC] = U_Sharada;
for (char32 c = 0x111DD; c <= 0x111DF; ++c) (*smap)[c] = U_Sharada;
for (char32 c = 0x110D0; c <= 0x110E8; ++c) (*smap)[c] = U_Sora_Sompeng;
for (char32 c = 0x110F0; c <= 0x110F9; ++c) (*smap)[c] = U_Sora_Sompeng;
for (char32 c = 0x11680; c <= 0x116AA; ++c) (*smap)[c] = U_Takri;
(*smap)[0x116AB] = U_Takri;
(*smap)[0x116AC] = U_Takri;
(*smap)[0x116AD] = U_Takri;
for (char32 c = 0x116AE; c <= 0x116AF; ++c) (*smap)[c] = U_Takri;
for (char32 c = 0x116B0; c <= 0x116B5; ++c) (*smap)[c] = U_Takri;
(*smap)[0x116B6] = U_Takri;
(*smap)[0x116B7] = U_Takri;
for (char32 c = 0x116C0; c <= 0x116C9; ++c) (*smap)[c] = U_Takri;
for (char32 c = 0x10530; c <= 0x10563; ++c) (*smap)[c] = U_Caucasian_Albanian;
(*smap)[0x1056F] = U_Caucasian_Albanian;
for (char32 c = 0x16AD0; c <= 0x16AED; ++c) (*smap)[c] = U_Bassa_Vah;
for (char32 c = 0x16AF0; c <= 0x16AF4; ++c) (*smap)[c] = U_Bassa_Vah;
(*smap)[0x16AF5] = U_Bassa_Vah;
for (char32 c = 0x1BC00; c <= 0x1BC6A; ++c) (*smap)[c] = U_Duployan;
for (char32 c = 0x1BC70; c <= 0x1BC7C; ++c) (*smap)[c] = U_Duployan;
for (char32 c = 0x1BC80; c <= 0x1BC88; ++c) (*smap)[c] = U_Duployan;
for (char32 c = 0x1BC90; c <= 0x1BC99; ++c) (*smap)[c] = U_Duployan;
(*smap)[0x1BC9C] = U_Duployan;
for (char32 c = 0x1BC9D; c <= 0x1BC9E; ++c) (*smap)[c] = U_Duployan;
(*smap)[0x1BC9F] = U_Duployan;
for (char32 c = 0x10500; c <= 0x10527; ++c) (*smap)[c] = U_Elbasan;
for (char32 c = 0x11300; c <= 0x11301; ++c) (*smap)[c] = U_Grantha;
for (char32 c = 0x11302; c <= 0x11303; ++c) (*smap)[c] = U_Grantha;
for (char32 c = 0x11305; c <= 0x1130C; ++c) (*smap)[c] = U_Grantha;
for (char32 c = 0x1130F; c <= 0x11310; ++c) (*smap)[c] = U_Grantha;
for (char32 c = 0x11313; c <= 0x11328; ++c) (*smap)[c] = U_Grantha;
for (char32 c = 0x1132A; c <= 0x11330; ++c) (*smap)[c] = U_Grantha;
for (char32 c = 0x11332; c <= 0x11333; ++c) (*smap)[c] = U_Grantha;
for (char32 c = 0x11335; c <= 0x11339; ++c) (*smap)[c] = U_Grantha;
(*smap)[0x1133C] = U_Grantha;
(*smap)[0x1133D] = U_Grantha;
for (char32 c = 0x1133E; c <= 0x1133F; ++c) (*smap)[c] = U_Grantha;
(*smap)[0x11340] = U_Grantha;
for (char32 c = 0x11341; c <= 0x11344; ++c) (*smap)[c] = U_Grantha;
for (char32 c = 0x11347; c <= 0x11348; ++c) (*smap)[c] = U_Grantha;
for (char32 c = 0x1134B; c <= 0x1134D; ++c) (*smap)[c] = U_Grantha;
(*smap)[0x11350] = U_Grantha;
(*smap)[0x11357] = U_Grantha;
for (char32 c = 0x1135D; c <= 0x11361; ++c) (*smap)[c] = U_Grantha;
for (char32 c = 0x11362; c <= 0x11363; ++c) (*smap)[c] = U_Grantha;
for (char32 c = 0x11366; c <= 0x1136C; ++c) (*smap)[c] = U_Grantha;
for (char32 c = 0x11370; c <= 0x11374; ++c) (*smap)[c] = U_Grantha;
for (char32 c = 0x16B00; c <= 0x16B2F; ++c) (*smap)[c] = U_Pahawh_Hmong;
for (char32 c = 0x16B30; c <= 0x16B36; ++c) (*smap)[c] = U_Pahawh_Hmong;
for (char32 c = 0x16B37; c <= 0x16B3B; ++c) (*smap)[c] = U_Pahawh_Hmong;
for (char32 c = 0x16B3C; c <= 0x16B3F; ++c) (*smap)[c] = U_Pahawh_Hmong;
for (char32 c = 0x16B40; c <= 0x16B43; ++c) (*smap)[c] = U_Pahawh_Hmong;
(*smap)[0x16B44] = U_Pahawh_Hmong;
(*smap)[0x16B45] = U_Pahawh_Hmong;
for (char32 c = 0x16B50; c <= 0x16B59; ++c) (*smap)[c] = U_Pahawh_Hmong;
for (char32 c = 0x16B5B; c <= 0x16B61; ++c) (*smap)[c] = U_Pahawh_Hmong;
for (char32 c = 0x16B63; c <= 0x16B77; ++c) (*smap)[c] = U_Pahawh_Hmong;
for (char32 c = 0x16B7D; c <= 0x16B8F; ++c) (*smap)[c] = U_Pahawh_Hmong;
for (char32 c = 0x11200; c <= 0x11211; ++c) (*smap)[c] = U_Khojki;
for (char32 c = 0x11213; c <= 0x1122B; ++c) (*smap)[c] = U_Khojki;
for (char32 c = 0x1122C; c <= 0x1122E; ++c) (*smap)[c] = U_Khojki;
for (char32 c = 0x1122F; c <= 0x11231; ++c) (*smap)[c] = U_Khojki;
for (char32 c = 0x11232; c <= 0x11233; ++c) (*smap)[c] = U_Khojki;
(*smap)[0x11234] = U_Khojki;
(*smap)[0x11235] = U_Khojki;
for (char32 c = 0x11236; c <= 0x11237; ++c) (*smap)[c] = U_Khojki;
for (char32 c = 0x11238; c <= 0x1123D; ++c) (*smap)[c] = U_Khojki;
(*smap)[0x1123E] = U_Khojki;
for (char32 c = 0x10600; c <= 0x10736; ++c) (*smap)[c] = U_Linear_A;
for (char32 c = 0x10740; c <= 0x10755; ++c) (*smap)[c] = U_Linear_A;
for (char32 c = 0x10760; c <= 0x10767; ++c) (*smap)[c] = U_Linear_A;
for (char32 c = 0x11150; c <= 0x11172; ++c) (*smap)[c] = U_Mahajani;
(*smap)[0x11173] = U_Mahajani;
for (char32 c = 0x11174; c <= 0x11175; ++c) (*smap)[c] = U_Mahajani;
(*smap)[0x11176] = U_Mahajani;
for (char32 c = 0x10AC0; c <= 0x10AC7; ++c) (*smap)[c] = U_Manichaean;
(*smap)[0x10AC8] = U_Manichaean;
for (char32 c = 0x10AC9; c <= 0x10AE4; ++c) (*smap)[c] = U_Manichaean;
for (char32 c = 0x10AE5; c <= 0x10AE6; ++c) (*smap)[c] = U_Manichaean;
for (char32 c = 0x10AEB; c <= 0x10AEF; ++c) (*smap)[c] = U_Manichaean;
for (char32 c = 0x10AF0; c <= 0x10AF6; ++c) (*smap)[c] = U_Manichaean;
for (char32 c = 0x1E800; c <= 0x1E8C4; ++c) (*smap)[c] = U_Mende_Kikakui;
for (char32 c = 0x1E8C7; c <= 0x1E8CF; ++c) (*smap)[c] = U_Mende_Kikakui;
for (char32 c = 0x1E8D0; c <= 0x1E8D6; ++c) (*smap)[c] = U_Mende_Kikakui;
for (char32 c = 0x11600; c <= 0x1162F; ++c) (*smap)[c] = U_Modi;
for (char32 c = 0x11630; c <= 0x11632; ++c) (*smap)[c] = U_Modi;
for (char32 c = 0x11633; c <= 0x1163A; ++c) (*smap)[c] = U_Modi;
for (char32 c = 0x1163B; c <= 0x1163C; ++c) (*smap)[c] = U_Modi;
(*smap)[0x1163D] = U_Modi;
(*smap)[0x1163E] = U_Modi;
for (char32 c = 0x1163F; c <= 0x11640; ++c) (*smap)[c] = U_Modi;
for (char32 c = 0x11641; c <= 0x11643; ++c) (*smap)[c] = U_Modi;
(*smap)[0x11644] = U_Modi;
for (char32 c = 0x11650; c <= 0x11659; ++c) (*smap)[c] = U_Modi;
for (char32 c = 0x16A40; c <= 0x16A5E; ++c) (*smap)[c] = U_Mro;
for (char32 c = 0x16A60; c <= 0x16A69; ++c) (*smap)[c] = U_Mro;
for (char32 c = 0x16A6E; c <= 0x16A6F; ++c) (*smap)[c] = U_Mro;
for (char32 c = 0x10A80; c <= 0x10A9C; ++c) (*smap)[c] = U_Old_North_Arabian;
for (char32 c = 0x10A9D; c <= 0x10A9F; ++c) (*smap)[c] = U_Old_North_Arabian;
for (char32 c = 0x10880; c <= 0x1089E; ++c) (*smap)[c] = U_Nabataean;
for (char32 c = 0x108A7; c <= 0x108AF; ++c) (*smap)[c] = U_Nabataean;
for (char32 c = 0x10860; c <= 0x10876; ++c) (*smap)[c] = U_Palmyrene;
for (char32 c = 0x10877; c <= 0x10878; ++c) (*smap)[c] = U_Palmyrene;
for (char32 c = 0x10879; c <= 0x1087F; ++c) (*smap)[c] = U_Palmyrene;
for (char32 c = 0x11AC0; c <= 0x11AF8; ++c) (*smap)[c] = U_Pau_Cin_Hau;
for (char32 c = 0x10350; c <= 0x10375; ++c) (*smap)[c] = U_Old_Permic;
for (char32 c = 0x10376; c <= 0x1037A; ++c) (*smap)[c] = U_Old_Permic;
for (char32 c = 0x10B80; c <= 0x10B91; ++c) (*smap)[c] = U_Psalter_Pahlavi;
for (char32 c = 0x10B99; c <= 0x10B9C; ++c) (*smap)[c] = U_Psalter_Pahlavi;
for (char32 c = 0x10BA9; c <= 0x10BAF; ++c) (*smap)[c] = U_Psalter_Pahlavi;
for (char32 c = 0x11580; c <= 0x115AE; ++c) (*smap)[c] = U_Siddham;
for (char32 c = 0x115AF; c <= 0x115B1; ++c) (*smap)[c] = U_Siddham;
for (char32 c = 0x115B2; c <= 0x115B5; ++c) (*smap)[c] = U_Siddham;
for (char32 c = 0x115B8; c <= 0x115BB; ++c) (*smap)[c] = U_Siddham;
for (char32 c = 0x115BC; c <= 0x115BD; ++c) (*smap)[c] = U_Siddham;
(*smap)[0x115BE] = U_Siddham;
for (char32 c = 0x115BF; c <= 0x115C0; ++c) (*smap)[c] = U_Siddham;
for (char32 c = 0x115C1; c <= 0x115D7; ++c) (*smap)[c] = U_Siddham;
for (char32 c = 0x115D8; c <= 0x115DB; ++c) (*smap)[c] = U_Siddham;
for (char32 c = 0x115DC; c <= 0x115DD; ++c) (*smap)[c] = U_Siddham;
for (char32 c = 0x112B0; c <= 0x112DE; ++c) (*smap)[c] = U_Khudawadi;
(*smap)[0x112DF] = U_Khudawadi;
for (char32 c = 0x112E0; c <= 0x112E2; ++c) (*smap)[c] = U_Khudawadi;
for (char32 c = 0x112E3; c <= 0x112EA; ++c) (*smap)[c] = U_Khudawadi;
for (char32 c = 0x112F0; c <= 0x112F9; ++c) (*smap)[c] = U_Khudawadi;
for (char32 c = 0x11480; c <= 0x114AF; ++c) (*smap)[c] = U_Tirhuta;
for (char32 c = 0x114B0; c <= 0x114B2; ++c) (*smap)[c] = U_Tirhuta;
for (char32 c = 0x114B3; c <= 0x114B8; ++c) (*smap)[c] = U_Tirhuta;
(*smap)[0x114B9] = U_Tirhuta;
(*smap)[0x114BA] = U_Tirhuta;
for (char32 c = 0x114BB; c <= 0x114BE; ++c) (*smap)[c] = U_Tirhuta;
for (char32 c = 0x114BF; c <= 0x114C0; ++c) (*smap)[c] = U_Tirhuta;
(*smap)[0x114C1] = U_Tirhuta;
for (char32 c = 0x114C2; c <= 0x114C3; ++c) (*smap)[c] = U_Tirhuta;
for (char32 c = 0x114C4; c <= 0x114C5; ++c) (*smap)[c] = U_Tirhuta;
(*smap)[0x114C6] = U_Tirhuta;
(*smap)[0x114C7] = U_Tirhuta;
for (char32 c = 0x114D0; c <= 0x114D9; ++c) (*smap)[c] = U_Tirhuta;
for (char32 c = 0x118A0; c <= 0x118DF; ++c) (*smap)[c] = U_Warang_Citi;
for (char32 c = 0x118E0; c <= 0x118E9; ++c) (*smap)[c] = U_Warang_Citi;
for (char32 c = 0x118EA; c <= 0x118F2; ++c) (*smap)[c] = U_Warang_Citi;
(*smap)[0x118FF] = U_Warang_Citi;
for (char32 c = 0x11700; c <= 0x11719; ++c) (*smap)[c] = U_Ahom;
for (char32 c = 0x1171D; c <= 0x1171F; ++c) (*smap)[c] = U_Ahom;
for (char32 c = 0x11720; c <= 0x11721; ++c) (*smap)[c] = U_Ahom;
for (char32 c = 0x11722; c <= 0x11725; ++c) (*smap)[c] = U_Ahom;
(*smap)[0x11726] = U_Ahom;
for (char32 c = 0x11727; c <= 0x1172B; ++c) (*smap)[c] = U_Ahom;
for (char32 c = 0x11730; c <= 0x11739; ++c) (*smap)[c] = U_Ahom;
for (char32 c = 0x1173A; c <= 0x1173B; ++c) (*smap)[c] = U_Ahom;
for (char32 c = 0x1173C; c <= 0x1173E; ++c) (*smap)[c] = U_Ahom;
(*smap)[0x1173F] = U_Ahom;
for (char32 c = 0x14400; c <= 0x14646; ++c)
(*smap)[c] = U_Anatolian_Hieroglyphs;
for (char32 c = 0x108E0; c <= 0x108F2; ++c) (*smap)[c] = U_Hatran;
for (char32 c = 0x108F4; c <= 0x108F5; ++c) (*smap)[c] = U_Hatran;
for (char32 c = 0x108FB; c <= 0x108FF; ++c) (*smap)[c] = U_Hatran;
for (char32 c = 0x11280; c <= 0x11286; ++c) (*smap)[c] = U_Multani;
(*smap)[0x11288] = U_Multani;
for (char32 c = 0x1128A; c <= 0x1128D; ++c) (*smap)[c] = U_Multani;
for (char32 c = 0x1128F; c <= 0x1129D; ++c) (*smap)[c] = U_Multani;
for (char32 c = 0x1129F; c <= 0x112A8; ++c) (*smap)[c] = U_Multani;
(*smap)[0x112A9] = U_Multani;
for (char32 c = 0x10C80; c <= 0x10CB2; ++c) (*smap)[c] = U_Old_Hungarian;
for (char32 c = 0x10CC0; c <= 0x10CF2; ++c) (*smap)[c] = U_Old_Hungarian;
for (char32 c = 0x10CFA; c <= 0x10CFF; ++c) (*smap)[c] = U_Old_Hungarian;
for (char32 c = 0x1D800; c <= 0x1D9FF; ++c) (*smap)[c] = U_SignWriting;
for (char32 c = 0x1DA00; c <= 0x1DA36; ++c) (*smap)[c] = U_SignWriting;
for (char32 c = 0x1DA37; c <= 0x1DA3A; ++c) (*smap)[c] = U_SignWriting;
for (char32 c = 0x1DA3B; c <= 0x1DA6C; ++c) (*smap)[c] = U_SignWriting;
for (char32 c = 0x1DA6D; c <= 0x1DA74; ++c) (*smap)[c] = U_SignWriting;
(*smap)[0x1DA75] = U_SignWriting;
for (char32 c = 0x1DA76; c <= 0x1DA83; ++c) (*smap)[c] = U_SignWriting;
(*smap)[0x1DA84] = U_SignWriting;
for (char32 c = 0x1DA85; c <= 0x1DA86; ++c) (*smap)[c] = U_SignWriting;
for (char32 c = 0x1DA87; c <= 0x1DA8B; ++c) (*smap)[c] = U_SignWriting;
for (char32 c = 0x1DA9B; c <= 0x1DA9F; ++c) (*smap)[c] = U_SignWriting;
for (char32 c = 0x1DAA1; c <= 0x1DAAF; ++c) (*smap)[c] = U_SignWriting;
for (char32 c = 0x1E900; c <= 0x1E943; ++c) (*smap)[c] = U_Adlam;
for (char32 c = 0x1E944; c <= 0x1E94A; ++c) (*smap)[c] = U_Adlam;
for (char32 c = 0x1E950; c <= 0x1E959; ++c) (*smap)[c] = U_Adlam;
for (char32 c = 0x1E95E; c <= 0x1E95F; ++c) (*smap)[c] = U_Adlam;
for (char32 c = 0x11C00; c <= 0x11C08; ++c) (*smap)[c] = U_Bhaiksuki;
for (char32 c = 0x11C0A; c <= 0x11C2E; ++c) (*smap)[c] = U_Bhaiksuki;
(*smap)[0x11C2F] = U_Bhaiksuki;
for (char32 c = 0x11C30; c <= 0x11C36; ++c) (*smap)[c] = U_Bhaiksuki;
for (char32 c = 0x11C38; c <= 0x11C3D; ++c) (*smap)[c] = U_Bhaiksuki;
(*smap)[0x11C3E] = U_Bhaiksuki;
(*smap)[0x11C3F] = U_Bhaiksuki;
(*smap)[0x11C40] = U_Bhaiksuki;
for (char32 c = 0x11C41; c <= 0x11C45; ++c) (*smap)[c] = U_Bhaiksuki;
for (char32 c = 0x11C50; c <= 0x11C59; ++c) (*smap)[c] = U_Bhaiksuki;
for (char32 c = 0x11C5A; c <= 0x11C6C; ++c) (*smap)[c] = U_Bhaiksuki;
for (char32 c = 0x11C70; c <= 0x11C71; ++c) (*smap)[c] = U_Marchen;
for (char32 c = 0x11C72; c <= 0x11C8F; ++c) (*smap)[c] = U_Marchen;
for (char32 c = 0x11C92; c <= 0x11CA7; ++c) (*smap)[c] = U_Marchen;
(*smap)[0x11CA9] = U_Marchen;
for (char32 c = 0x11CAA; c <= 0x11CB0; ++c) (*smap)[c] = U_Marchen;
(*smap)[0x11CB1] = U_Marchen;
for (char32 c = 0x11CB2; c <= 0x11CB3; ++c) (*smap)[c] = U_Marchen;
(*smap)[0x11CB4] = U_Marchen;
for (char32 c = 0x11CB5; c <= 0x11CB6; ++c) (*smap)[c] = U_Marchen;
for (char32 c = 0x11400; c <= 0x11434; ++c) (*smap)[c] = U_Newa;
for (char32 c = 0x11435; c <= 0x11437; ++c) (*smap)[c] = U_Newa;
for (char32 c = 0x11438; c <= 0x1143F; ++c) (*smap)[c] = U_Newa;
for (char32 c = 0x11440; c <= 0x11441; ++c) (*smap)[c] = U_Newa;
for (char32 c = 0x11442; c <= 0x11444; ++c) (*smap)[c] = U_Newa;
(*smap)[0x11445] = U_Newa;
(*smap)[0x11446] = U_Newa;
for (char32 c = 0x11447; c <= 0x1144A; ++c) (*smap)[c] = U_Newa;
for (char32 c = 0x1144B; c <= 0x1144F; ++c) (*smap)[c] = U_Newa;
for (char32 c = 0x11450; c <= 0x11459; ++c) (*smap)[c] = U_Newa;
(*smap)[0x1145B] = U_Newa;
(*smap)[0x1145D] = U_Newa;
for (char32 c = 0x104B0; c <= 0x104D3; ++c) (*smap)[c] = U_Osage;
for (char32 c = 0x104D8; c <= 0x104FB; ++c) (*smap)[c] = U_Osage;
(*smap)[0x16FE0] = U_Tangut;
for (char32 c = 0x17000; c <= 0x187EC; ++c) (*smap)[c] = U_Tangut;
for (char32 c = 0x18800; c <= 0x18AF2; ++c) (*smap)[c] = U_Tangut;
}
} // namespace
} // namespace unicode_script
} // namespace sentencepiece
#endif // UNICODE_SCRIPT_DATA_H_
sentencepiece-0.1.96/src/normalizer.h 0000644 0001750 0000176 00000013313 14062671741 017107 0 ustar kenhys docker // Copyright 2016 Google Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.!
#ifndef NORMALIZER_NORMALIZER_H_
#define NORMALIZER_NORMALIZER_H_
#include
#include
#include
#include
#include
#include "common.h"
#include "sentencepiece_model.pb.h"
#include "sentencepiece_processor.h"
#include "third_party/absl/strings/string_view.h"
#include "third_party/darts_clone/darts.h"
#include "util.h"
namespace sentencepiece {
namespace normalizer {
// Given a list of strings, finds the longest string which is a
// prefix of a query.
class PrefixMatcher {
public:
// Initializes the PrefixMatcher with `dic`.
explicit PrefixMatcher(const std::set &dic);
// Finds the longest string in dic, which is a prefix of `w`.
// Returns the UTF8 byte length of matched string.
// `found` is set if a prefix match exists.
// If no entry is found, consumes one Unicode character.
int PrefixMatch(absl::string_view w, bool *found = nullptr) const;
// Replaces entries in `w` with `out`.
std::string GlobalReplace(absl::string_view w, absl::string_view out) const;
private:
std::unique_ptr trie_;
};
// Normalizer implements a simple text normalizer with
// user-defined string-to-string rules and leftmost longest
// matching. The rules of Normalizer are built with
// Builder::CompileCharsMap() method. Pre-compiled rules are
// also available via Builder::GetPrecompiledCharsMap() method.
//
// The motivation of Normalizer is to make flexible, user-customizable
// and self-contained normalizer. All the logic of normalization is
// encoded in the model proto which allows us to define language/task
// dependent normalization rules without breaking the default rule.
class Normalizer {
public:
// Instantiates Normalizer with |spec|.
// |spec| should not be deleted until Normalizer is destroyed.
explicit Normalizer(const NormalizerSpec &spec);
Normalizer(const NormalizerSpec &spec, const TrainerSpec &trainer_Spec);
virtual ~Normalizer();
virtual void SetPrefixMatcher(const PrefixMatcher *matcher) {
matcher_ = matcher;
}
// Returns Status.
// Normalizes function is valid only when status is OK.
virtual util::Status status() const { return status_; }
// Normalizes a plain utf8 string into an internal representation for
// Sentencepiece model. |norm_to_orig| stores the byte-alignment from
// normalized string to the original input.
// This function can do the following normalizations:
// - Character normalization.
// (NFKC / full-width to half-width conversion etc).
// - Adds a prefix space.
// - Replaces a space with a meta symbol.
// - Removing heading, tailing and other redundant spaces.
virtual util::Status Normalize(absl::string_view input,
std::string *normalized,
std::vector