pax_global_header 0000666 0000000 0000000 00000000064 13556236233 0014522 g ustar 00root root 0000000 0000000 52 comment=50990da5cc02675ed8546a47561bf98d630bb9f9
jaro_winkler-1.5.4/ 0000775 0000000 0000000 00000000000 13556236233 0014217 5 ustar 00root root 0000000 0000000 jaro_winkler-1.5.4/.gitignore 0000664 0000000 0000000 00000000174 13556236233 0016211 0 ustar 00root root 0000000 0000000 /.bundle/
/.yardoc
/Gemfile.lock
/_yardoc/
/coverage/
/doc/
/pkg/
/spec/reports/
/tmp/
*.bundle
*.so
*.o
*.a
mkmf.log
/tags
jaro_winkler-1.5.4/.rspec 0000664 0000000 0000000 00000000036 13556236233 0015333 0 ustar 00root root 0000000 0000000 --color
--require spec_helper
jaro_winkler-1.5.4/.travis.yml 0000664 0000000 0000000 00000000711 13556236233 0016327 0 ustar 00root root 0000000 0000000 language: ruby
before_install:
- gem install -v '1.17.3' bundler
install: bundle _1.17.3_ install
bundler_args: --jobs 3 --retry 3 --without benchmark
os:
- linux
- osx
rvm:
- 2.5.0
- 2.4.3
- 2.3.5
- 2.2.9
- 2.1.10
- 2.0.0
- 1.9.3
matrix:
exclude:
- rvm: 2.0.0
os: osx
- rvm: 1.9.3
os: osx
include:
- rvm: 2.0.0
os: osx
osx_image: xcode7.3
- rvm: 1.9.3
os: osx
osx_image: xcode7.3
jaro_winkler-1.5.4/CHANGELOG.md 0000664 0000000 0000000 00000004412 13556236233 0016031 0 ustar 00root root 0000000 0000000 ## [1.5.3](https://github.com/tonytonyjan/jaro_winkler/compare/v1.5.2...v1.5.3) (2019-06-18)
* Fall back to pure ruby implementation on LoadError ([49f811e](https://github.com/tonytonyjan/jaro_winkler/commit/49f811e))
* Rename Rake tasks for test ([42e0a36](https://github.com/tonytonyjan/jaro_winkler/commit/42e0a36))
## [1.5.2](https://github.com/tonytonyjan/jaro_winkler/compare/v1.5.1...v1.5.2) (2019-01-04)
### Bug Fixes
* raises TypeError when input type is not string ([c146491](https://github.com/tonytonyjan/jaro_winkler/commit/c146491)), closes [#24](https://github.com/tonytonyjan/jaro_winkler/issues/24)
* **memory:** make sure codepoints will be allocated/freed after rb_raise to prevent memory leak ([fe9d784](https://github.com/tonytonyjan/jaro_winkler/commit/fe9d784)), closes [#20](https://github.com/tonytonyjan/jaro_winkler/issues/20)
## [1.5.1](https://github.com/tonytonyjan/jaro_winkler/compare/v1.5.0...v1.5.1) (2018-06-06)
# [1.5.0](https://github.com/tonytonyjan/jaro_winkler/compare/v1.4.0...v1.5.0) (2017-10-02)
### Bug Fixes
* free codepoints before returning to prevent memory leak ([8babd4f](https://github.com/tonytonyjan/jaro_winkler/commit/8babd4f))
* remove module functions from JaroWinkler ([af249d5](https://github.com/tonytonyjan/jaro_winkler/commit/af249d5))
### Features
* support encodings other than utf-8 ([fe72ab4](https://github.com/tonytonyjan/jaro_winkler/commit/fe72ab4)), closes [#7](https://github.com/tonytonyjan/jaro_winkler/issues/7)
* support rubinius ([27090ff](https://github.com/tonytonyjan/jaro_winkler/commit/27090ff))
### Performance Improvements
The C implementation is 25% faster than that of v1.4.0.
* optimize single byte codepoints computation ([e0cdd51](https://github.com/tonytonyjan/jaro_winkler/commit/e0cdd51))
### BREAKING CHANGES
* JaroWinkler no longer supports mixin, use class methods instead, ex.
`JaroWinkler.distance`
# [1.4.0](https://github.com/tonytonyjan/jaro_winkler/compare/v1.3.7...v1.4.0) (2015-12-12)
* The pure Ruby version is about 2.5 times faster compared to 1.3.7
* Add `JaroWinkler.jaro_distance` for anyone who want to use jaro distance instead of jaro-winkler distance.
* Unify algorithms between C extensions and pure Ruby, especially, rewrote the pure Ruby version.
jaro_winkler-1.5.4/Gemfile 0000664 0000000 0000000 00000000172 13556236233 0015512 0 ustar 00root root 0000000 0000000 source 'https://rubygems.org'
gemspec
group :benchmark do
gem 'fuzzy-string-match'
gem 'hotwater'
gem 'amatch'
end jaro_winkler-1.5.4/LICENSE.txt 0000664 0000000 0000000 00000002055 13556236233 0016044 0 ustar 00root root 0000000 0000000 Copyright (c) 2014 Jian Weihang
MIT License
Permission is hereby granted, free of charge, to any person obtaining
a copy of this software and associated documentation files (the
"Software"), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to
the following conditions:
The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
jaro_winkler-1.5.4/README.md 0000664 0000000 0000000 00000015504 13556236233 0015503 0 ustar 00root root 0000000 0000000 [](https://travis-ci.org/tonytonyjan/jaro_winkler)
[jaro_winkler](https://rubygems.org/gems/jaro_winkler) is an implementation of [Jaro-Winkler distance](http://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance) algorithm which is written in C extension and will fallback to pure Ruby version in platforms other than MRI/KRI like JRuby or Rubinius. **Both of C and Ruby implementation support any kind of string encoding, such as UTF-8, EUC-JP, Big5, etc.**
# Installation
```
gem install jaro_winkler
```
# Usage
```ruby
require 'jaro_winkler'
# Jaro Winkler Distance
JaroWinkler.distance "MARTHA", "MARHTA"
# => 0.9611
JaroWinkler.distance "MARTHA", "marhta", ignore_case: true
# => 0.9611
JaroWinkler.distance "MARTHA", "MARHTA", weight: 0.2
# => 0.9778
# Jaro Distance
JaroWinkler.jaro_distance "MARTHA", "MARHTA"
# => 0.9444444444444445
```
There is no `JaroWinkler.jaro_winkler_distance`, it's tediously long.
## Options
Name | Type | Default | Note
----------- | ------ | ------- | ------------------------------------------------------------------------------------------------------------
ignore_case | boolean | false | All lower case characters are converted to upper case prior to the comparison.
weight | number | 0.1 | A constant scaling factor for how much the score is adjusted upwards for having common prefixes.
threshold | number | 0.7 | The prefix bonus is only added when the compared strings have a Jaro distance above the threshold.
adj_table | boolean | false | The option is used to give partial credit for characters that may be errors due to known phonetic or character recognition errors. A typical example is to match the letter "O" with the number "0".
# Adjusting Table
## Default Table
```
['A', 'E'], ['A', 'I'], ['A', 'O'], ['A', 'U'], ['B', 'V'], ['E', 'I'], ['E', 'O'], ['E', 'U'], ['I', 'O'], ['I', 'U'],
['O', 'U'], ['I', 'Y'], ['E', 'Y'], ['C', 'G'], ['E', 'F'], ['W', 'U'], ['W', 'V'], ['X', 'K'], ['S', 'Z'], ['X', 'S'],
['Q', 'C'], ['U', 'V'], ['M', 'N'], ['L', 'I'], ['Q', 'O'], ['P', 'R'], ['I', 'J'], ['2', 'Z'], ['5', 'S'], ['8', 'B'],
['1', 'I'], ['1', 'L'], ['0', 'O'], ['0', 'Q'], ['C', 'K'], ['G', 'J'], ['E', ' '], ['Y', ' '], ['S', ' ']
```
## How it works?
Original Formula:
%26%5Ctext%7Bothers%7D%5Cend%7Bcases%7D)
where
- `m` is the number of matching characters.
- `t` is half the number of transpositions.
With Adjusting Table:
%26%5Ctext%7Bothers%7D%5Cend%7Bcases%7D)
where
- `s` is the number of nonmatching but similar characters.
# Why This?
There is also another similar gem named [fuzzy-string-match](https://github.com/kiyoka/fuzzy-string-match) which both provides C and Ruby version as well.
I reinvent this wheel because of the naming in `fuzzy-string-match` such as `getDistance` breaks convention, and some weird code like `a1 = s1.split( // )` (`s1.chars` could be better), furthermore, it's bugged (see tables below).
# Compare with other gems
| | jaro_winkler | fuzzystringmatch | hotwater | amatch |
|-----------------|--------------|------------------|----------|---------|
| Encoding Support| **Yes** | Pure Ruby only | No | No |
| Windows Support | **Yes** | ? | No | **Yes** |
| Adjusting Table | **Yes** | No | No | No |
| Native | **Yes** | **Yes** | **Yes** | **Yes** |
| Pure Ruby | **Yes** | **Yes** | No | No |
| Speed | **1st** | 3rd | 2nd | 4th |
I made a table below to compare accuracy between each gem:
str_1 | str_2 | origin | jaro_winkler | fuzzystringmatch | hotwater | amatch
--- | --- | --- | --- | --- | --- | ---
"henka" | "henkan" | 0.9667 | 0.9667 | **0.9722** | 0.9667 | **0.9444**
"al" | "al" | 1.0 | 1.0 | 1.0 | 1.0 | 1.0
"martha" | "marhta" | 0.9611 | 0.9611 | 0.9611 | 0.9611 | **0.9444**
"jones" | "johnson" | 0.8324 | 0.8324 | 0.8324 | 0.8324 | **0.7905**
"abcvwxyz" | "cabvwxyz" | 0.9583 | 0.9583 | 0.9583 | 0.9583 | 0.9583
"dwayne" | "duane" | 0.84 | 0.84 | 0.84 | 0.84 | **0.8222**
"dixon" | "dicksonx" | 0.8133 | 0.8133 | 0.8133 | 0.8133 | **0.7667**
"fvie" | "ten" | 0.0 | 0.0 | 0.0 | 0.0 | 0.0
- The "origin" result is from the [original C implementation by the author of the algorithm](http://web.archive.org/web/20100227020019/http://www.census.gov/geo/msb/stand/strcmp.c).
- Test data are borrowed from [fuzzy-string-match's rspec file](https://github.com/kiyoka/fuzzy-string-match/blob/master/test/basic_pure_spec.rb).
# Benchmark
```
$ bundle exec rake benchmark
ruby 2.4.1p111 (2017-03-22 revision 58053) [x86_64-darwin16]
# C Extension
Rehearsal --------------------------------------------------------------
jaro_winkler (8c16e09) 0.240000 0.000000 0.240000 ( 0.241347)
fuzzy-string-match (1.0.1) 0.400000 0.010000 0.410000 ( 0.403673)
hotwater (0.1.2) 0.250000 0.000000 0.250000 ( 0.254503)
amatch (0.4.0) 0.870000 0.000000 0.870000 ( 0.875930)
----------------------------------------------------- total: 1.770000sec
user system total real
jaro_winkler (8c16e09) 0.230000 0.000000 0.230000 ( 0.236921)
fuzzy-string-match (1.0.1) 0.380000 0.000000 0.380000 ( 0.381942)
hotwater (0.1.2) 0.250000 0.000000 0.250000 ( 0.254977)
amatch (0.4.0) 0.860000 0.000000 0.860000 ( 0.861207)
# Pure Ruby
Rehearsal --------------------------------------------------------------
jaro_winkler (8c16e09) 0.440000 0.000000 0.440000 ( 0.438470)
fuzzy-string-match (1.0.1) 0.860000 0.000000 0.860000 ( 0.862850)
----------------------------------------------------- total: 1.300000sec
user system total real
jaro_winkler (8c16e09) 0.440000 0.000000 0.440000 ( 0.439237)
fuzzy-string-match (1.0.1) 0.910000 0.010000 0.920000 ( 0.920259)
```
# Todo
- Custom adjusting word table.
jaro_winkler-1.5.4/Rakefile 0000664 0000000 0000000 00000005721 13556236233 0015671 0 ustar 00root root 0000000 0000000 require 'rubygems/package_task'
require 'rake/extensiontask'
require 'rake/testtask'
task default: :test
task test: %w[test:pure_ruby test:compiled]
task benchmark: %w[benchmark:native benchmark:pure]
task :print_ruby_version do
print "#{RUBY_DESCRIPTION}\n\n"
end
namespace :benchmark do
task native: :print_ruby_version do |t, args|
puts '# C Extension'
load File.expand_path("../benchmark/native.rb", __FILE__)
puts
end
task pure: :print_ruby_version do |t, args|
puts '# Pure Ruby'
load File.expand_path("../benchmark/pure.rb", __FILE__)
puts
end
task :measure do
tags = ENV['TAGS'] ? ENV['TAGS'].split(',') : `git tag --list`.split.select { |v| v.match? /\Av1\.[1-9]\.\d\z/ }
puts 'version,label,utime,stime,cutime,cstime,real'
tags.each do |tag|
sh("git checkout -f #{tag} 1>&2")
sh('git checkout master -- benchmark 1>&2')
sh('bundle exec rake clobber compile 1>&2')
sh("ruby #{File.expand_path("../benchmark/measure.rb", __FILE__)}")
end
end
end
task compare: :compile do
require 'jaro_winkler'
require 'fuzzystringmatch'
require 'hotwater'
require 'amatch'
@ary = [['henka', 'henkan'], ['al', 'al'], ['martha', 'marhta'], ['jones', 'johnson'], ['abcvwxyz', 'cabvwxyz'], ['dwayne', 'duane'], ['dixon', 'dicksonx'], ['fvie', 'ten'], ['San Francisco', 'Santa Monica']]
table = []
table << %w[str_1 str_2 jaro_winkler fuzzystringmatch hotwater amatch]
table << %w[--- --- --- --- --- ---]
jarow = FuzzyStringMatch::JaroWinkler.create(:native)
@ary.each do |str_1, str_2|
table << ["\"#{str_1}\"", "\"#{str_2}\"", JaroWinkler.distance(str_1, str_2).round(4), jarow.getDistance(str_1, str_2).round(4), Hotwater.jaro_winkler_distance(str_1, str_2).round(4), Amatch::Jaro.new(str_1).match(str_2).round(4)]
end
col_len = []
table.first.length.times{ |i| col_len << table.map{ |row| row[i].to_s.length }.max }
table.first.each_with_index{ |title, i| "%-#{col_len[i]}s" % title }
table.each_with_index do |row|
row.each_with_index do |col, i|
row[i] = "%-#{col_len[i]}s" % col.to_s
end
end
table.each{|row| puts row.join(' | ')}
end
if RUBY_ENGINE == 'ruby'
Rake::ExtensionTask.new 'jaro_winkler_ext' do |ext|
ext.lib_dir = 'lib/jaro_winkler'
ext.ext_dir = 'ext/jaro_winkler'
end
else
task :compile do
puts 'Can not compile C extension, fallback to pure Ruby version.'
end
end
namespace :test do
Rake::TestTask.new(:compiled => :compile) do |t|
t.libs << 'test'
t.test_files = FileList['test/test_jaro_winkler.rb']
t.verbose = true
end
Rake::TestTask.new(:pure_ruby) do |t|
t.libs << 'test'
t.test_files = FileList['test/test_pure_ruby.rb']
t.verbose = true
end
end
%w[jaro_winkler jaro_winkler.java]
.map { |name| Gem::Specification.load(File.expand_path("../#{name}.gemspec", __FILE__)) }
.each { |spec| Gem::PackageTask.new(spec).define }
task 'CHANGELOG.md' do
sh 'conventional-changelog -p angular -i CHANGELOG.md -s'
end
jaro_winkler-1.5.4/benchmark/ 0000775 0000000 0000000 00000000000 13556236233 0016151 5 ustar 00root root 0000000 0000000 jaro_winkler-1.5.4/benchmark/env.rb 0000664 0000000 0000000 00000000404 13556236233 0017264 0 ustar 00root root 0000000 0000000 # frozen_string_literal: true
$LOAD_PATH << File.expand_path('../../lib', __FILE__)
require 'bundler'
Bundler.setup(:benchmark)
require File.expand_path('../samples', __FILE__)
def gem_name_with_version(gem)
"#{gem} (#{Gem.loaded_specs[gem].version})"
end
jaro_winkler-1.5.4/benchmark/measure.rb 0000664 0000000 0000000 00000002567 13556236233 0020151 0 ustar 00root root 0000000 0000000 # frozen_string_literal: true
require File.expand_path('../samples', __FILE__)
gem 'jaro_winkler', ENV['JARO_WINKLER_VERSION'] || ARGV[0] || raise('missing ENV["JARO_WINKLER_VERSION"]')
require 'benchmark'
require 'csv'
csv = CSV.new($stdout)
n = 100_000
version = Gem::Version.new(Gem.loaded_specs["jaro_winkler"].version)
if version >= Gem::Version.new('1.1.0') && version < Gem::Version.new('1.2.4')
require 'jaro_winkler.bundle'
else
require 'jaro_winkler'
end
jobs = {
ascii: -> { n.times { SAMPLES[:ascii].each { |str1, str2| JaroWinkler.jaro_winkler_distance(str1, str2) } } }
}
if version >= Gem::Version.new('1.1.0')
jobs[:ascii] = -> { n.times { SAMPLES[:ascii].each { |str1, str2| JaroWinkler.c_distance(str1, str2) } } }
end
if version >= Gem::Version.new('1.2.0')
jobs[:utf8] = -> { n.times { SAMPLES[:utf8].each { |str1, str2| JaroWinkler.c_distance(str1, str2) } } }
end
if version >= Gem::Version.new('1.4.0')
jobs[:ascii] = -> { n.times { SAMPLES[:ascii].each { |str1, str2| JaroWinkler.distance(str1, str2) } } }
jobs[:utf8] = -> { n.times { SAMPLES[:utf8].each { |str1, str2| JaroWinkler.distance(str1, str2) } } }
end
# rehearsal
jobs.each { |label, job| Benchmark.measure(label, &job) }
# take
jobs.each do |label, job|
GC.start
tms = Benchmark.measure(label, &job)
# version,label,utime,stime,cutime,cstime,real
csv << [version, *tms.to_a]
end
jaro_winkler-1.5.4/benchmark/native.rb 0000664 0000000 0000000 00000001606 13556236233 0017767 0 ustar 00root root 0000000 0000000 # frozen_string_literal: true
require File.expand_path('../env', __FILE__)
require 'benchmark'
require 'jaro_winkler/jaro_winkler_ext'
require 'fuzzystringmatch'
require 'hotwater'
require 'amatch'
n = 100_000
Benchmark.bmbm do |x|
x.report "jaro_winkler (#{`git rev-parse --short HEAD`.chop!})" do
n.times { SAMPLES[:ascii].each { |str1, str2| JaroWinkler.distance(str1, str2) } }
end
x.report gem_name_with_version('fuzzy-string-match') do
jarow = FuzzyStringMatch::JaroWinkler.create(:native)
n.times { SAMPLES[:ascii].each { |str1, str2| jarow.getDistance(str1, str2) } }
end
x.report gem_name_with_version('hotwater') do
n.times { SAMPLES[:ascii].each { |str1, str2| Hotwater.jaro_winkler_distance(str1, str2) } }
end
x.report gem_name_with_version('amatch') do
n.times { SAMPLES[:ascii].each { |str1, str2| Amatch::Jaro.new(str1).match(str2) } }
end
end
jaro_winkler-1.5.4/benchmark/pure.rb 0000664 0000000 0000000 00000001072 13556236233 0017451 0 ustar 00root root 0000000 0000000 # frozen_string_literal: true
require File.expand_path('../env', __FILE__)
require 'benchmark'
require 'jaro_winkler/jaro_winkler_pure'
require 'fuzzystringmatch'
n = 10_000
Benchmark.bmbm do |x|
x.report "jaro_winkler (#{`git rev-parse --short HEAD`.chop!})" do
n.times { SAMPLES[:ascii].each { |str1, str2| JaroWinkler.distance(str1, str2) } }
end
x.report gem_name_with_version('fuzzy-string-match') do
jarow = FuzzyStringMatch::JaroWinkler.create(:pure)
n.times { SAMPLES[:ascii].each { |str1, str2| jarow.getDistance(str1, str2) } }
end
end
jaro_winkler-1.5.4/benchmark/samples.rb 0000664 0000000 0000000 00000001031 13556236233 0020135 0 ustar 00root root 0000000 0000000 SAMPLES = {
ascii: [
%w[al al], %w[martha marhta], %w[jones johnson], %w[abcvwxyz cabvwxyz],
%w[dwayne duane], %w[dixon dicksonx], %w[fvie ten]
].freeze,
utf8: [
%w[馬英九 馬英丸], %w[蔡英文 蔡中文], %w[簡煒航 簡偉航], %w[焦玟綾 焦紋綾],
%w[眼球中央電視台 眼球中英電視台], %w[床前明月光 床前日月光],
%w[海水退了就知道誰沒穿褲子 海水退了就知道誰沒穿襪子],
%w[阿里山的姑娘美如水 阿里山的姑娘沒乳水]
].freeze
}.freeze
jaro_winkler-1.5.4/bin/ 0000775 0000000 0000000 00000000000 13556236233 0014767 5 ustar 00root root 0000000 0000000 jaro_winkler-1.5.4/bin/measure 0000775 0000000 0000000 00000000377 13556236233 0016365 0 ustar 00root root 0000000 0000000 #!/bin/sh
echo version,label,utime,stime,cutime,cstime,real
gem search -ear jaro_winkler \
| grep -o '\((.*)\)$' \
| tr -d '() ' \
| tr ',' "\n" \
| grep -o '\d\.\d\.\d' \
| sort \
| xargs -I{} ruby "`dirname $0`"/../benchmark/measure.rb '{}'
jaro_winkler-1.5.4/ext/ 0000775 0000000 0000000 00000000000 13556236233 0015017 5 ustar 00root root 0000000 0000000 jaro_winkler-1.5.4/ext/jaro_winkler/ 0000775 0000000 0000000 00000000000 13556236233 0017505 5 ustar 00root root 0000000 0000000 jaro_winkler-1.5.4/ext/jaro_winkler/adj_matrix.c 0000664 0000000 0000000 00000006041 13556236233 0021774 0 ustar 00root root 0000000 0000000 #include "adj_matrix.h"
#include "codepoints.h"
#include "ruby.h"
const char *DEFAULT_ADJ_TABLE[] = {
"A", "E", "A", "I", "A", "O", "A", "U", "B", "V", "E", "I", "E",
"O", "E", "U", "I", "O", "I", "U", "O", "U", "I", "Y", "E", "Y",
"C", "G", "E", "F", "W", "U", "W", "V", "X", "K", "S", "Z", "X",
"S", "Q", "C", "U", "V", "M", "N", "L", "I", "Q", "O", "P", "R",
"I", "J", "2", "Z", "5", "S", "8", "B", "1", "I", "1", "L", "0",
"O", "0", "Q", "C", "K", "G", "J", "E", " ", "Y", " ", "S", " "};
void node_free(Node *head);
AdjMatrix *adj_matrix_new(uint32_t length) {
AdjMatrix *matrix = malloc(sizeof(AdjMatrix));
matrix->length = length == 0 ? ADJ_MATRIX_DEFAULT_LENGTH : length;
matrix->table = malloc(matrix->length * sizeof(Node **));
for (size_t i = 0; i < matrix->length; i++) {
matrix->table[i] = malloc(matrix->length * sizeof(Node *));
for (size_t j = 0; j < matrix->length; j++)
matrix->table[i][j] = NULL;
}
return matrix;
}
void adj_matrix_add(AdjMatrix *matrix, uint64_t x, uint64_t y) {
uint32_t h1 = st_hash(&x, sizeof(x), ADJ_MATRIX_SEED) %
ADJ_MATRIX_DEFAULT_LENGTH,
h2 = st_hash(&y, sizeof(y), ADJ_MATRIX_SEED) %
ADJ_MATRIX_DEFAULT_LENGTH;
Node *new_node = malloc(sizeof(Node));
new_node->x = h1;
new_node->y = h2;
new_node->next = NULL;
if (matrix->table[h1][h2] == NULL) {
matrix->table[h1][h2] = matrix->table[h2][h1] = new_node;
} else {
Node *previous = NULL;
for (Node *i = matrix->table[h1][h2]; i != NULL; i = i->next)
previous = i;
previous->next = new_node;
}
}
char adj_matrix_find(AdjMatrix *matrix, uint64_t x, uint64_t y) {
uint32_t h1 = st_hash(&x, sizeof(x), ADJ_MATRIX_SEED) %
ADJ_MATRIX_DEFAULT_LENGTH,
h2 = st_hash(&y, sizeof(y), ADJ_MATRIX_SEED) %
ADJ_MATRIX_DEFAULT_LENGTH;
Node *node = matrix->table[h1][h2];
if (node == NULL)
return 0;
else {
for (Node *i = node; i != NULL; i = i->next)
if ((i->x == h1 && i->y == h2) || (i->x == h2 && i->y == h1))
return 1;
return 0;
}
}
void node_free(Node *head) {
if (head == NULL)
return;
node_free(head->next);
free(head);
}
void adj_matrix_free(AdjMatrix *matrix) {
for (size_t i = 0; i < matrix->length; i++) {
for (size_t j = 0; j < matrix->length; j++)
if (matrix->table[i][j] != NULL) {
node_free(matrix->table[i][j]);
matrix->table[i][j] = matrix->table[j][i] = NULL;
}
free(matrix->table[i]);
}
free(matrix->table);
free(matrix);
}
AdjMatrix *adj_matrix_default() {
static char first_time = 1;
static AdjMatrix *ret_matrix;
if (first_time) {
ret_matrix = adj_matrix_new(ADJ_MATRIX_DEFAULT_LENGTH);
size_t length = sizeof(DEFAULT_ADJ_TABLE) / sizeof(char *);
for (size_t i = 0; i < length; i += 2) {
uint64_t code_1, code_2;
code_1 = *DEFAULT_ADJ_TABLE[i] & 0xff;
code_2 = *DEFAULT_ADJ_TABLE[i + 1] & 0xff;
adj_matrix_add(ret_matrix, code_1, code_2);
}
first_time = 0;
}
return ret_matrix;
}
jaro_winkler-1.5.4/ext/jaro_winkler/adj_matrix.h 0000664 0000000 0000000 00000000750 13556236233 0022002 0 ustar 00root root 0000000 0000000 #pragma once
#include "stdint.h"
#define ADJ_MATRIX_DEFAULT_LENGTH 958
#define ADJ_MATRIX_SEED 9527
typedef struct _node {
struct _node *next;
uint64_t x, y;
} Node;
typedef struct {
Node ***table;
uint32_t length;
} AdjMatrix;
AdjMatrix *adj_matrix_new(uint32_t length);
void adj_matrix_add(AdjMatrix *matrix, uint64_t x, uint64_t y);
char adj_matrix_find(AdjMatrix *matrix, uint64_t x, uint64_t y);
void adj_matrix_free(AdjMatrix *matrix);
AdjMatrix *adj_matrix_default();
jaro_winkler-1.5.4/ext/jaro_winkler/codepoints.c 0000664 0000000 0000000 00000003304 13556236233 0022020 0 ustar 00root root 0000000 0000000 #include "codepoints.h"
#include "ruby.h"
#include "ruby/encoding.h"
#include
#include
#include
// this function is copied from string.c
static inline int single_byte_optimizable(VALUE str) {
rb_encoding *enc;
/* Conservative. It may be ENC_CODERANGE_UNKNOWN. */
if (ENC_CODERANGE(str) == ENC_CODERANGE_7BIT)
return 1;
enc = rb_enc_get(str);
if (rb_enc_mbmaxlen(enc) == 1)
return 1;
/* Conservative. Possibly single byte.
* "\xa1" in Shift_JIS for example. */
return 0;
}
void codepoints_init(CodePoints *codepoints, VALUE str) {
size_t i, length;
int32_t n;
uint32_t c;
const char *ptr, *end;
rb_encoding *enc;
if (single_byte_optimizable(str)) {
length = RSTRING_LEN(str);
ptr = RSTRING_PTR(str);
codepoints->data = malloc(length * sizeof(*codepoints->data));
for (i = 0, codepoints->length = 0; i < length; i++, codepoints->length++)
codepoints->data[i] = ptr[i] & 0xff;
} else {
codepoints->length = 0;
codepoints->size = 32;
codepoints->data = malloc(codepoints->size * sizeof(*codepoints->data));
str = rb_str_new_frozen(str);
ptr = RSTRING_PTR(str);
end = RSTRING_END(str);
enc = rb_enc_get(str);
while (ptr < end) {
c = rb_enc_codepoint_len(ptr, end, &n, enc);
if (codepoints->length == codepoints->size) {
codepoints->size *= 2;
codepoints->data = realloc(codepoints->data, sizeof(*codepoints->data) *
codepoints->size);
}
codepoints->data[codepoints->length++] = c;
ptr += n;
}
RB_GC_GUARD(str);
}
}
void codepoints_free(CodePoints *codepoints) { free(codepoints->data); }
jaro_winkler-1.5.4/ext/jaro_winkler/codepoints.h 0000664 0000000 0000000 00000000355 13556236233 0022030 0 ustar 00root root 0000000 0000000 #pragma once
#include "ruby.h"
#include
#include
typedef struct {
uint32_t *data;
size_t length;
size_t size;
} CodePoints;
void codepoints_init(CodePoints *, VALUE str);
void codepoints_free(CodePoints *);
jaro_winkler-1.5.4/ext/jaro_winkler/extconf.rb 0000664 0000000 0000000 00000000166 13556236233 0021503 0 ustar 00root root 0000000 0000000 # frozen_string_literal: true
require 'mkmf'
$CFLAGS << ' -std=c99 '
create_makefile('jaro_winkler/jaro_winkler_ext') jaro_winkler-1.5.4/ext/jaro_winkler/jaro.c 0000664 0000000 0000000 00000007315 13556236233 0020612 0 ustar 00root root 0000000 0000000 #include "jaro.h"
#include "adj_matrix.h"
#include "codepoints.h"
#include
#include
#include
#define DEFAULT_WEIGHT 0.1
#define DEFAULT_THRESHOLD 0.7
#define SWAP(x, y) \
do { \
__typeof__(x) SWAP = x; \
x = y; \
y = SWAP; \
} while (0)
const Options DEFAULT_OPTIONS = {.weight = DEFAULT_WEIGHT,
.threshold = DEFAULT_THRESHOLD,
.ignore_case = 0,
.adj_table = 0};
double jaro_distance_from_codes(uint32_t *codepoints1, size_t len1,
uint32_t *codepoints2, size_t len2,
Options *opt) {
if (!len1 || !len2)
return 0.0;
if (len1 > len2) {
SWAP(codepoints1, codepoints2);
SWAP(len1, len2);
}
if (opt->ignore_case) {
for (size_t i = 0; i < len1; i++)
codepoints1[i] = tolower(codepoints1[i]);
for (size_t i = 0; i < len2; i++)
codepoints2[i] = tolower(codepoints2[i]);
}
int32_t window_size = (int32_t)len2 / 2 - 1;
if (window_size < 0)
window_size = 0;
char short_codes_flag[len1];
char long_codes_flag[len2];
memset(short_codes_flag, 0, len1);
memset(long_codes_flag, 0, len2);
// count number of matching characters
size_t match_count = 0;
for (size_t i = 0; i < len1; i++) {
size_t left = (i >= (size_t)window_size) ? i - window_size : 0;
size_t right =
(i + window_size <= len2 - 1) ? (i + window_size) : (len2 - 1);
if (right > len2 - 1)
right = len2 - 1;
for (size_t j = left; j <= right; j++) {
if (!long_codes_flag[j] && codepoints1[i] == codepoints2[j]) {
short_codes_flag[i] = long_codes_flag[j] = 1;
match_count++;
break;
}
}
}
if (!match_count)
return 0.0;
// count number of transpositions
size_t transposition_count = 0, j = 0, k = 0;
for (size_t i = 0; i < len1; i++) {
if (short_codes_flag[i]) {
for (j = k; j < len2; j++) {
if (long_codes_flag[j]) {
k = j + 1;
break;
}
}
if (codepoints1[i] != codepoints2[j])
transposition_count++;
}
}
// count similarities in nonmatched characters
size_t similar_count = 0;
if (opt->adj_table && len1 > match_count)
for (size_t i = 0; i < len1; i++)
if (!short_codes_flag[i])
for (size_t j = 0; j < len2; j++)
if (!long_codes_flag[j])
if (adj_matrix_find(adj_matrix_default(), codepoints1[i],
codepoints2[j])) {
similar_count += 3;
break;
}
double m = (double)match_count;
double t = (double)(transposition_count / 2);
if (opt->adj_table)
m = similar_count / 10.0 + m;
return (m / len1 + m / len2 + (m - t) / m) / 3;
}
double jaro_winkler_distance_from_codes(uint32_t *codepoints1, size_t len1,
uint32_t *codepoints2, size_t len2,
Options *opt) {
double jaro_distance =
jaro_distance_from_codes(codepoints1, len1, codepoints2, len2, opt);
if (jaro_distance < opt->threshold)
return jaro_distance;
else {
size_t prefix = 0;
size_t max_4 = len1 > 4 ? 4 : len1;
for (prefix = 0;
prefix < max_4 && codepoints1[prefix] == codepoints2[prefix]; prefix++)
;
return jaro_distance + prefix * opt->weight * (1 - jaro_distance);
}
}
jaro_winkler-1.5.4/ext/jaro_winkler/jaro.h 0000664 0000000 0000000 00000001026 13556236233 0020610 0 ustar 00root root 0000000 0000000 #pragma once
#include
#include
typedef struct {
double weight, threshold;
char ignore_case, adj_table;
} Options;
extern const Options DEFAULT_OPTIONS;
double jaro_distance_from_codes(uint32_t *codepoints1, size_t len1,
uint32_t *codepoints2, size_t len2, Options *);
double jaro_winkler_distance_from_codes(uint32_t *codepoints1, size_t len1,
uint32_t *codepoints2, size_t len2,
Options *);
jaro_winkler-1.5.4/ext/jaro_winkler/jaro_winkler.c 0000664 0000000 0000000 00000005532 13556236233 0022344 0 ustar 00root root 0000000 0000000 #include "codepoints.h"
#include "jaro.h"
#include "ruby.h"
VALUE rb_mJaroWinkler, rb_eError, rb_eInvalidWeightError;
VALUE rb_jaro_winkler_distance(size_t argc, VALUE *argv, VALUE self);
VALUE rb_jaro_distance(size_t argc, VALUE *argv, VALUE self);
VALUE distance(size_t argc, VALUE *argv, VALUE self,
double (*distance_fn)(uint32_t *codepoints1, size_t len1,
uint32_t *codepoints2, size_t len2,
Options *));
void Init_jaro_winkler_ext(void) {
rb_mJaroWinkler = rb_define_module("JaroWinkler");
rb_eError = rb_define_class_under(rb_mJaroWinkler, "Error", rb_eRuntimeError);
rb_eInvalidWeightError =
rb_define_class_under(rb_mJaroWinkler, "InvalidWeightError", rb_eError);
rb_define_singleton_method(rb_mJaroWinkler, "distance",
rb_jaro_winkler_distance, -1);
rb_define_singleton_method(rb_mJaroWinkler, "jaro_distance", rb_jaro_distance,
-1);
}
VALUE distance(size_t argc, VALUE *argv, VALUE self,
double (*distance_fn)(uint32_t *codepoints1, size_t len1,
uint32_t *codepoints2, size_t len2,
Options *)) {
VALUE s1, s2, opt;
rb_scan_args((int32_t)argc, argv, "2:", &s1, &s2, &opt);
Check_Type(s1, T_STRING);
Check_Type(s2, T_STRING);
Options c_opt = DEFAULT_OPTIONS;
if (TYPE(opt) == T_HASH) {
VALUE weight = rb_hash_aref(opt, ID2SYM(rb_intern("weight"))),
threshold = rb_hash_aref(opt, ID2SYM(rb_intern("threshold"))),
ignore_case = rb_hash_aref(opt, ID2SYM(rb_intern("ignore_case"))),
adj_table = rb_hash_aref(opt, ID2SYM(rb_intern("adj_table")));
if (!NIL_P(weight))
c_opt.weight = NUM2DBL(weight);
if (c_opt.weight > 0.25)
rb_raise(rb_eInvalidWeightError, "Scaling factor should not exceed 0.25, "
"otherwise the distance can become "
"larger than 1.");
if (!NIL_P(threshold))
c_opt.threshold = NUM2DBL(threshold);
if (!NIL_P(ignore_case))
c_opt.ignore_case =
(TYPE(ignore_case) == T_FALSE || NIL_P(ignore_case)) ? 0 : 1;
if (!NIL_P(adj_table))
c_opt.adj_table =
(TYPE(adj_table) == T_FALSE || NIL_P(adj_table)) ? 0 : 1;
}
CodePoints cp1, cp2;
codepoints_init(&cp1, s1);
codepoints_init(&cp2, s2);
VALUE ret = rb_float_new(
(*distance_fn)(cp1.data, cp1.length, cp2.data, cp2.length, &c_opt));
codepoints_free(&cp1);
codepoints_free(&cp2);
return ret;
}
VALUE rb_jaro_distance(size_t argc, VALUE *argv, VALUE self) {
return distance(argc, argv, self, jaro_distance_from_codes);
}
VALUE rb_jaro_winkler_distance(size_t argc, VALUE *argv, VALUE self) {
return distance(argc, argv, self, jaro_winkler_distance_from_codes);
}
jaro_winkler-1.5.4/jaro_winkler.gemspec 0000664 0000000 0000000 00000002276 13556236233 0020261 0 ustar 00root root 0000000 0000000 # coding: utf-8
lib = File.expand_path('../lib', __FILE__)
$LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
require 'jaro_winkler/version'
Gem::Specification.new do |spec|
spec.name = 'jaro_winkler'
spec.version = JaroWinkler::VERSION
spec.authors = ['Jian Weihang']
spec.email = 'tonytonyjan@gmail.com'
spec.extensions = ['ext/jaro_winkler/extconf.rb']
spec.summary = 'An implementation of Jaro-Winkler distance algorithm written \
in C extension which supports any kind of string encoding.'
spec.description = 'jaro_winkler is an implementation of Jaro-Winkler \
distance algorithm which is written in C extension and will fallback to pure \
Ruby version in platforms other than MRI/KRI like JRuby or Rubinius. Both of \
C and Ruby implementation support any kind of string encoding, such as \
UTF-8, EUC-JP, Big5, etc.'
spec.homepage = 'https://github.com/tonytonyjan/jaro_winkler'
spec.license = 'MIT'
spec.files = Dir['lib/**/*.rb', 'ext/**/*.{h,c}', 'LICENSE.txt']
spec.add_development_dependency 'bundler', '~> 1.7'
spec.add_development_dependency 'rake', '~> 12.0'
spec.add_development_dependency 'rake-compiler'
spec.add_development_dependency 'minitest'
end
jaro_winkler-1.5.4/jaro_winkler.java.gemspec 0000664 0000000 0000000 00000002237 13556236233 0021176 0 ustar 00root root 0000000 0000000 # frozen_string_literal: true
lib = File.expand_path('../lib', __FILE__)
$LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
require 'jaro_winkler/version'
Gem::Specification.new do |spec|
spec.name = 'jaro_winkler'
spec.version = JaroWinkler::VERSION
spec.authors = ['Jian Weihang']
spec.email = 'tonytonyjan@gmail.com'
spec.summary = 'An implementation of Jaro-Winkler distance algorithm written \
in C extension which supports any kind of string encoding.'
spec.description = 'jaro_winkler is an implementation of Jaro-Winkler \
distance algorithm which is written in C extension and will fallback to pure \
Ruby version in platforms other than MRI/KRI like JRuby or Rubinius. Both of \
C and Ruby implementation support any kind of string encoding, such as \
UTF-8, EUC-JP, Big5, etc.'
spec.homepage = 'https://github.com/tonytonyjan/jaro_winkler'
spec.license = 'MIT'
spec.files = Dir['lib/**/*.rb', 'LICENSE.txt']
spec.add_development_dependency 'bundler', '~> 1.7'
spec.add_development_dependency 'rake', '~> 12.0'
spec.add_development_dependency 'rake-compiler'
spec.add_development_dependency 'minitest'
spec.platform = 'java'
end
jaro_winkler-1.5.4/lib/ 0000775 0000000 0000000 00000000000 13556236233 0014765 5 ustar 00root root 0000000 0000000 jaro_winkler-1.5.4/lib/jaro_winkler.rb 0000664 0000000 0000000 00000000601 13556236233 0017775 0 ustar 00root root 0000000 0000000 # frozen_string_literal: true
require 'jaro_winkler/version'
if RUBY_ENGINE == 'ruby'
begin
require 'jaro_winkler/jaro_winkler_ext'
rescue LoadError
# Fall back to the pure implementation if the extension
# can't be loaded for any reason (e.g. it was never built)
require 'jaro_winkler/jaro_winkler_pure'
end
else
require 'jaro_winkler/jaro_winkler_pure'
end
jaro_winkler-1.5.4/lib/jaro_winkler/ 0000775 0000000 0000000 00000000000 13556236233 0017453 5 ustar 00root root 0000000 0000000 jaro_winkler-1.5.4/lib/jaro_winkler/adjusting_table.rb 0000664 0000000 0000000 00000001134 13556236233 0023136 0 ustar 00root root 0000000 0000000 # frozen_string_literal: true
module JaroWinkler
DEFAULT_ADJ_TABLE = Hash.new { |h, k| h[k] = Hash.new(&h.default_proc) }
[
%w[A E], %w[A I], %w[A O], %w[A U], %w[B V], %w[E I], %w[E O], %w[E U], %w[I O],
%w[I U], %w[O U], %w[I Y], %w[E Y], %w[C G], %w[E F], %w[W U], %w[W V], %w[X K],
%w[S Z], %w[X S], %w[Q C], %w[U V], %w[M N], %w[L I], %w[Q O], %w[P R], %w[I J],
%w[2 Z], %w[5 S], %w[8 B], %w[1 I], %w[1 L], %w[0 O], %w[0 Q], %w[C K], %w[G J],
['E', ' '], ['Y', ' '], ['S', ' ']
].each do |s1, s2|
DEFAULT_ADJ_TABLE[s1][s2] = DEFAULT_ADJ_TABLE[s2][s1] = true
end
end
jaro_winkler-1.5.4/lib/jaro_winkler/jaro_winkler_pure.rb 0000664 0000000 0000000 00000007226 13556236233 0023530 0 ustar 00root root 0000000 0000000 # frozen_string_literal: true
require 'jaro_winkler/adjusting_table'
module JaroWinkler
class Error < RuntimeError; end
class InvalidWeightError < Error; end
DEFAULT_WEIGHT = 0.1
DEFAULT_THRESHOLD = 0.7
DEFAULT_OPTIONS = {
jaro: { adj_table: false, ignore_case: false },
jaro_winkler: { weight: DEFAULT_WEIGHT, threshold: DEFAULT_THRESHOLD }
}.freeze
class << self
def distance(str1, str2, options = {})
validate!(str1, str2)
_distance str1.codepoints.to_a, str2.codepoints.to_a, options
end
def jaro_distance(str1, str2, options = {})
validate!(str1, str2)
_jaro_distance str1.codepoints.to_a, str2.codepoints.to_a, options
end
private
def _distance(codes1, codes2, options = {})
options = DEFAULT_OPTIONS[:jaro_winkler].merge options
raise InvalidWeightError if options[:weight] > 0.25
jaro_distance = _jaro_distance(codes1, codes2, options)
if jaro_distance < options[:threshold]
jaro_distance
else
codes1, codes2 = codes2, codes1 if codes1.length > codes2.length
len1 = codes1.length
len2 = codes2.length
max_4 = len1 > 4 ? 4 : len1
prefix = 0
prefix += 1 while prefix < max_4 && codes1[prefix] == codes2[prefix]
jaro_distance + prefix * options[:weight] * (1 - jaro_distance)
end
end
def _jaro_distance(codes1, codes2, options = {})
options = DEFAULT_OPTIONS[:jaro].merge options
codes1, codes2 = codes2, codes1 if codes1.length > codes2.length
len1 = codes1.length
len2 = codes2.length
return 0.0 if len1 == 0 || len2 == 0
if options[:ignore_case]
codes1.map! { |c| c >= 97 && c <= 122 ? c -= 32 : c }
codes2.map! { |c| c >= 97 && c <= 122 ? c -= 32 : c }
end
window = len2 / 2 - 1
window = 0 if window < 0
flags1 = 0
flags2 = 0
# // count number of matching characters
match_count = 0
i = 0
while i < len1
left = i >= window ? i - window : 0
right = i + window <= len2 - 1 ? (i + window) : (len2 - 1)
right = len2 - 1 if right > len2 - 1
j = left
while j <= right
if flags2[j] == 0 && codes1[i] == codes2[j]
flags1 |= (1 << i)
flags2 |= (1 << j)
match_count += 1
break
end
j += 1
end
i += 1
end
return 0.0 if match_count == 0
# // count number of transpositions
transposition_count = j = k = 0
i = 0
while i < len1
if flags1[i] == 1
j = k
while j < len2
if flags2[j] == 1
k = j + 1
break
end
j += 1
end
transposition_count += 1 if codes1[i] != codes2[j]
end
i += 1
end
# // count similarities in nonmatched characters
similar_count = 0
if options[:adj_table] && len1 > match_count
i = 0
while i < len1
if flags1[i] == 0
j = 0
while j < len2
if flags2[j] == 0
if DEFAULT_ADJ_TABLE[codes1[i].chr(Encoding::UTF_8)][codes2[j].chr(Encoding::UTF_8)]
similar_count += 3
break
end
end
j += 1
end
end
i += 1
end
end
m = match_count.to_f
t = transposition_count / 2
m = similar_count / 10.0 + m if options[:adj_table]
(m / len1 + m / len2 + (m - t) / m) / 3
end
def validate!(str1, str2)
raise TypeError unless str1.is_a?(String) && str2.is_a?(String)
end
end
end
jaro_winkler-1.5.4/lib/jaro_winkler/version.rb 0000664 0000000 0000000 00000000112 13556236233 0021457 0 ustar 00root root 0000000 0000000 # frozen_string_literal: true
module JaroWinkler
VERSION = '1.5.4'
end
jaro_winkler-1.5.4/test/ 0000775 0000000 0000000 00000000000 13556236233 0015176 5 ustar 00root root 0000000 0000000 jaro_winkler-1.5.4/test/test_jaro_winkler.rb 0000664 0000000 0000000 00000000230 13556236233 0021243 0 ustar 00root root 0000000 0000000 require 'minitest/autorun'
require_relative 'tests'
require 'jaro_winkler/jaro_winkler_ext'
class TestJaroWinkler < Minitest::Test
include Tests
end
jaro_winkler-1.5.4/test/test_pure_ruby.rb 0000664 0000000 0000000 00000000231 13556236233 0020572 0 ustar 00root root 0000000 0000000 require 'minitest/autorun'
require_relative 'tests'
require 'jaro_winkler/jaro_winkler_pure'
class TestJaroWinkler < Minitest::Test
include Tests
end
jaro_winkler-1.5.4/test/tests.rb 0000664 0000000 0000000 00000011725 13556236233 0016673 0 ustar 00root root 0000000 0000000 # encoding: utf-8
module Tests
def test_distance
assert_distance 0.9667, 'henka', 'henkan'
assert_distance 1.0, 'al', 'al'
assert_distance 0.9611, 'martha', 'marhta'
assert_distance 0.8324, 'jones', 'johnson'
assert_distance 0.9583, 'abcvwxyz', 'cabvwxyz'
assert_distance 0.84, 'dwayne', 'duane'
assert_distance 0.8133, 'dixon', 'dicksonx'
assert_distance 0.0, 'fvie', 'ten'
assert_distance 1.0, 'tony', 'tony'
assert_distance 1.0, 'tonytonyjan', 'tonytonyjan'
assert_distance 1.0, 'x', 'x'
assert_distance 0.0, '', ''
assert_distance 0.0, 'tony', ''
assert_distance 0.0, '', 'tony'
assert_distance 0.8727, 'tonytonyjan', 'tony'
assert_distance 0.8727, 'tony', 'tonytonyjan'
assert_distance 0.9407, 'necessary', 'nessecary'
assert_distance 0.9067, 'does_exist', 'doesnt_exist'
assert_distance 0.975, '12345678', '12345687'
assert_distance 0.975, '12345678', '12345867'
assert_distance 0.95, '12345678', '12348567'
end
def test_jaro_distance
assert_jaro_distance 0.9444, 'henka', 'henkan'
assert_jaro_distance 1.0, 'al', 'al'
assert_jaro_distance 0.9444, 'martha', 'marhta'
assert_jaro_distance 0.7905, 'jones', 'johnson'
assert_jaro_distance 0.9583, 'abcvwxyz', 'cabvwxyz'
assert_jaro_distance 0.8222, 'dwayne', 'duane'
assert_jaro_distance 0.7667, 'dixon', 'dicksonx'
assert_jaro_distance 0.0, 'fvie', 'ten'
assert_jaro_distance 1.0, 'tony', 'tony'
assert_jaro_distance 1.0, 'tonytonyjan', 'tonytonyjan'
assert_jaro_distance 1.0, 'x', 'x'
assert_jaro_distance 0.0, '', ''
assert_jaro_distance 0.0, 'tony', ''
assert_jaro_distance 0.0, '', 'tony'
assert_jaro_distance 0.7879, 'tonytonyjan', 'tony'
assert_jaro_distance 0.7879, 'tony', 'tonytonyjan'
assert_jaro_distance 0.9259, 'necessary', 'nessecary'
assert_jaro_distance 0.8444, 'does_exist', 'doesnt_exist'
assert_jaro_distance 0.9583, '12345678', '12345687'
assert_jaro_distance 0.9583, '12345678', '12345867'
assert_jaro_distance 0.9167, '12345678', '12348567'
assert_jaro_distance 0.604, 'tonytonyjan', 'janjantony'
end
def test_unicode
assert_distance 0.9818, '變形金剛4:絕跡重生', '變形金剛4: 絕跡重生'
assert_distance 0.8222, '連勝文', '連勝丼'
assert_distance 0.8222, '馬英九', '馬英丸'
assert_distance 0.6667, '良い', 'いい'
end
def test_ignore_case
assert_distance 0.9611, 'MARTHA', 'marhta', ignore_case: true
end
def test_weight
assert_distance 0.9778, 'MARTHA', 'MARHTA', weight: 0.2
end
def test_threshold
assert_distance 0.9444, 'MARTHA', 'MARHTA', threshold: 0.99
end
def test_adjusting_table
assert_distance 0.9667, 'HENKA', 'HENKAN', adj_table: true
assert_distance 1.0, 'AL', 'AL', adj_table: true
assert_distance 0.9611, 'MARTHA', 'MARHTA', adj_table: true
assert_distance 0.8598, 'JONES', 'JOHNSON', adj_table: true
assert_distance 0.9583, 'ABCVWXYZ', 'CABVWXYZ', adj_table: true
assert_distance 0.8730, 'DWAYNE', 'DUANE', adj_table: true
assert_distance 0.8393, 'DIXON', 'DICKSONX', adj_table: true
assert_distance 0.0, 'FVIE', 'TEN', adj_table: true
end
def test_error
assert_raises JaroWinkler::InvalidWeightError do
JaroWinkler.distance 'MARTHA', 'MARHTA', weight: 0.26
end
end
def test_long_string
JaroWinkler.distance 'haisai' * 20, 'haisai' * 20
end
def test_encoding
assert_encoding '焦玟綾', '焦紋綾', Encoding::Big5
assert_encoding '簡煒航', '簡偉航', Encoding::Big5_HKSCS
assert_encoding '西島之', '西鳥志', Encoding::EUCJP
assert_encoding '松本行弘', '枩本行弘', Encoding::Shift_JIS
assert_distance 1.0, "\xe8".force_encoding('iso8859-1'), 'è'
end
def test_raises_type_error
assert_raises(TypeError){ JaroWinkler.distance 'MARTHA', nil }
assert_raises(TypeError){ JaroWinkler.distance nil, 'MARTHA' }
assert_raises(TypeError){ JaroWinkler.distance nil, nil }
assert_raises(TypeError){ JaroWinkler.distance 'MARTHA', :non_string }
assert_raises(TypeError){ JaroWinkler.distance :non_string, 'MARTHA' }
assert_raises(TypeError){ JaroWinkler.distance :non_string, :non_string }
end
private
def assert_distance score, str1, str2, options={}
assert_in_delta score, JaroWinkler.distance(str1, str2, options)
end
def assert_encoding str1, str2, encoding, options={}
assert_distance JaroWinkler.distance(str1, str2), str1.encode(encoding), str2.encode(encoding)
end
def assert_jaro_distance score, str1, str2, options={}
assert_in_delta score, JaroWinkler.jaro_distance(str1, str2, options)
end
end