Up
Size ratio:N | BT_A1 | BT_A2 | CL_A1 | CL_A2 | DR_A1 | DR_A2 | HS_A1 | HS_A2 | RN_A1 | RN_A2 |
BT_A1 | 1.000000 | 1.072581 | 1.002055 | 1.071010 | 1.011057 | 1.082101 | 1.368569 | 1.071010 | 1.388046 | 1.425926 |
BT_A2 | 0.932331 | 1.000000 | 0.934247 | 0.998536 | 0.942640 | 1.008876 | 1.275959 | 0.998536 | 1.294118 | 1.329435 |
CL_A1 | 0.997949 | 1.070381 | 1.000000 | 1.068814 | 1.008984 | 1.079882 | 1.365762 | 1.068814 | 1.385199 | 1.423002 |
CL_A2 | 0.933698 | 1.001466 | 0.935616 | 1.000000 | 0.944022 | 1.010355 | 1.277830 | 1.000000 | 1.296015 | 1.331384 |
DR_A1 | 0.989064 | 1.060850 | 0.991096 | 1.059297 | 1.000000 | 1.070266 | 1.353601 | 1.059297 | 1.372865 | 1.410331 |
DR_A2 | 0.924129 | 0.991202 | 0.926027 | 0.989751 | 0.934347 | 1.000000 | 1.264733 | 0.989751 | 1.282732 | 1.317739 |
HS_A1 | 0.730690 | 0.783724 | 0.732192 | 0.782577 | 0.738770 | 0.790680 | 1.000000 | 0.782577 | 1.014231 | 1.041910 |
HS_A2 | 0.933698 | 1.001466 | 0.935616 | 1.000000 | 0.944022 | 1.010355 | 1.277830 | 1.000000 | 1.296015 | 1.331384 |
RN_A1 | 0.720437 | 0.772727 | 0.721918 | 0.771596 | 0.728404 | 0.779586 | 0.985968 | 0.771596 | 1.000000 | 1.027290 |
RN_A2 | 0.701299 | 0.752199 | 0.702740 | 0.751098 | 0.709053 | 0.758876 | 0.959775 | 0.751098 | 0.973435 | 1.000000 |
Similar text [as %]:
This calculates the similarity between two strings as described in Oliver [1993].N | BT_A1 | BT_A2 | CL_A1 | CL_A2 | DR_A1 | DR_A2 | HS_A1 | HS_A2 | RN_A1 | RN_A2 |
BT_A1 | 100.00 | 47.40 | 97.37 | 50.41 | 75.46 | 48.53 | 82.39 | 52.60 | 72.79 | 7.96 |
BT_A2 | 47.19 | 100.00 | 47.38 | 94.51 | 52.29 | 62.67 | 37.81 | 93.41 | 10.34 | 70.63 |
CL_A1 | 97.43 | 47.38 | 100.00 | 49.40 | 75.95 | 48.36 | 82.56 | 52.02 | 72.32 | 8.13 |
CL_A2 | 48.71 | 94.51 | 50.18 | 100.00 | 48.28 | 61.74 | 35.81 | 94.36 | 10.41 | 67.89 |
DR_A1 | 75.74 | 49.38 | 76.37 | 47.00 | 100.00 | 38.94 | 63.99 | 48.28 | 52.06 | 26.45 |
DR_A2 | 42.77 | 60.16 | 43.46 | 58.13 | 36.73 | 100.00 | 40.56 | 60.41 | 15.88 | 46.93 |
HS_A1 | 82.39 | 35.35 | 82.56 | 34.58 | 62.56 | 38.08 | 100.00 | 38.19 | 74.05 | 9.07 |
HS_A2 | 53.23 | 93.33 | 52.09 | 94.51 | 49.41 | 66.15 | 40.49 | 100.00 | 9.92 | 66.81 |
RN_A1 | 72.79 | 20.93 | 72.32 | 20.58 | 44.22 | 22.11 | 74.05 | 9.34 | 100.00 | 51.35 |
RN_A2 | 18.64 | 70.63 | 14.00 | 66.56 | 25.80 | 40.62 | 15.18 | 66.97 | 38.94 | 100.00 |
The Levenshtein distance is defined as the minimal number of characters you have to replace, insert or delete to transform str1 into str2.
int levenshtein ( string str1, string str2 [, int cost_ins, int cost_rep, int cost_del])
A second variant will take three additional parameters that define the cost of insert, replace and delete operations. This is more general and adaptive than variant one, but not as efficient.
It has limit of 255 characters. We will extend it.
Max chars: 1463
Levenshtein from 0 to 255
Levenshtein from 255 to 510
Levenshtein from 510 to 765
Levenshtein from 765 to 1020
Levenshtein from 1020 to 1275
Levenshtein from 1275 to 1530
Levenshtein [1,1,1]:
N | BT_A1 | BT_A2 | CL_A1 | CL_A2 | DR_A1 | DR_A2 | HS_A1 | HS_A2 | RN_A1 | RN_A2 |
BT_A1 | 0 | 903 | 70 | 893 | 469 | 909 | 429 | 884 | 1031 | 1061 |
BT_A2 | 903 | 0 | 892 | 98 | 902 | 463 | 882 | 115 | 921 | 941 |
CL_A1 | 70 | 892 | 0 | 891 | 440 | 903 | 448 | 883 | 1033 | 1059 |
CL_A2 | 893 | 98 | 891 | 0 | 896 | 492 | 883 | 76 | 921 | 944 |
DR_A1 | 469 | 902 | 440 | 896 | 0 | 909 | 731 | 895 | 1016 | 1044 |
DR_A2 | 909 | 463 | 903 | 492 | 909 | 0 | 871 | 487 | 890 | 923 |
HS_A1 | 429 | 882 | 448 | 883 | 731 | 871 | 0 | 874 | 650 | 671 |
HS_A2 | 884 | 115 | 883 | 76 | 895 | 487 | 874 | 0 | 923 | 940 |
RN_A1 | 1031 | 921 | 1033 | 921 | 1016 | 890 | 650 | 923 | 0 | 428 |
RN_A2 | 1061 | 941 | 1059 | 944 | 1044 | 923 | 671 | 940 | 428 | 0 |
Max chars: 1463
Levenshtein from 0 to 255
Levenshtein from 255 to 510
Levenshtein from 510 to 765
Levenshtein from 765 to 1020
Levenshtein from 1020 to 1275
Levenshtein from 1275 to 1530
Levenshtein [1,10,1]:
N | BT_A1 | BT_A2 | CL_A1 | CL_A2 | DR_A1 | DR_A2 | HS_A1 | HS_A2 | RN_A1 | RN_A2 |
BT_A1 | 0 | 1337 | 103 | 1325 | 726 | 1345 | 454 | 1327 | 1411 | 1443 |
BT_A2 | 1337 | 0 | 1320 | 170 | 1337 | 762 | 1215 | 200 | 1304 | 1190 |
CL_A1 | 103 | 1320 | 0 | 1312 | 703 | 1340 | 471 | 1316 | 1404 | 1440 |
CL_A2 | 1325 | 170 | 1312 | 0 | 1345 | 798 | 1213 | 150 | 1302 | 1248 |
DR_A1 | 726 | 1337 | 703 | 1345 | 0 | 1347 | 924 | 1337 | 1407 | 1451 |
DR_A2 | 1345 | 762 | 1340 | 798 | 1347 | 0 | 1207 | 790 | 1268 | 1298 |
HS_A1 | 454 | 1215 | 471 | 1213 | 924 | 1207 | 0 | 1205 | 1039 | 1053 |
HS_A2 | 1327 | 200 | 1316 | 150 | 1337 | 790 | 1205 | 0 | 1302 | 1250 |
RN_A1 | 1411 | 1304 | 1404 | 1302 | 1407 | 1268 | 1039 | 1302 | 0 | 710 |
RN_A2 | 1443 | 1190 | 1440 | 1248 | 1451 | 1298 | 1053 | 1250 | 710 | 0 |
Data source:Name | Reference |
BT_A1 | >ipi|NP_001029211|NP_001029211 COLLAGEN ALPHA-1(I) Bos Taurus |
BT_A2 | >ipi|NP_776945|NP_776945.1 COLLAGEN ALPHA-2(I) Bos Taurus |
CL_A1 | >ipi|NP_001003090|NP_001003090.1 COLLAGEN ALPHA-1(I) Canis lupus familiaris |
CL_A2 | >ipi|NP_001003187|NP_001003187.1 COLLAGEN ALPHA-2(I) Canis lupus familiaris |
DR_A1 | >ipi|NP_954684|NP_954684.1 COLLAGEN ALPHA-1(I) Danio rerio |
DR_A2 | >ipi|CAK05064|CAK05064.1 COLLAGEN ALPHA-2(I) Danio rerio |
HS_A1 | >ipi|CAA67261|CAA67261.1 COLLAGEN ALPHA-1(I) Homo sapiens |
HS_A2 | >ipi|AAH42586|AAH42586.1 COLLAGEN ALPHA-2(I) Homo sapiens |
RN_A1 | >ipi|IPI00188909|IPI00188909.2 COLLAGEN ALPHA-1(I) CHAIN PRECURSOR. |
RN_A2 | >ipi|IPI00188921|IPI00188921.1 COLLAGEN ALPHA-2(I) CHAIN PRECURSOR. |
References:Levenshtein distance
In information theory and computer science, the Levenshtein distance is a string metric which is one way to measure edit distance. The Levenshtein distance between two strings is given by the minimum number of operations needed to transform one string into the other, where an operation is an insertion, deletion, or substitution of a single character. It is named after Vladimir Levenshtein, who considered this distance in 1965 [V. I. Levenshtein, Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10 (1966):707–710]. It is useful in applications that need to determine how similar two strings are.
This algorithm is based on the Wagner-Fischer algorithm for edit distance. Here is pseudocode for a function LevenshteinDistance that takes two strings, s of length m, and t of length n, and computes the Levenshtein distance between them:
int LevenshteinDistance(char s[1..m], char t[1..n])
// d is a table with m+1 rows and n+1 columns
declare int d[0..m, 0..n]
for i from 0 to m
d[i, 0] := i
for j from 1 to n
d[0, j] := j
for i from 1 to m
for j from 1 to n
if s[i] = t[j] then cost := 0
else cost := 1
d[i, j] := minimum(
d[i-1, j] + 1, // deletion
d[i, j-1] + 1, // insertion
d[i-1, j-1] + cost // substitution
)
return d[m, n]
http://www.csse.monash.edu.au/~dld/Publications/1993/Dowe+Oliver+Dix+Allison+Wallace1993_Decision_Graph_Explanation.html
Oliver, J.J., Dowe, D.L., Wallace, C.S., Inferring decision graphs using the minimum message length principle, (1992) Proceedings of the 5th Australian Joint Conference on Artificial Intelligence, pp. 361-367. Cited 9 times. Eds., A. Adams and L. Sterling, Singapore: World Scientific
Oliver, J.J., Hand, D.J., Introduction to minimum encoding inference (1994) Technical Report TR 94-4. Cited 8 times. Department of Statistics, Open University, Walton Hall, Milton Keynes, MK7 6AA, UK. Also available as TR 205, Department of Computer Science, Monash University, Clayton, Victoria 3168, Australia, http://www.cs.monash.edu.au/~jono
Oliver, J.J., Baxter, R.A., (1994) MML and Bayesianism: Similarities and Differences. Cited 15 times. Technical report TR 206, Dept. of Computer Science, Monash University, Clayton, Victoria 3168, Australia, http://www.cs.monash.edu.au/~jono
function TfMain.Fuzzy_Simil(padrao, texto: string; tpadrao,ttexto:integer):variant;
var max, l, pos1, pos2, sum, end1, end2, cont1, cont2 : integer;
p, q : string;
begin
max := 1; end1 := ttexto + 1; end2 := tpadrao + 1; pos1:=1;
pos2:=1; p := texto; cont1:=1;
while cont1 < end1 do begin
q:=padrao;
cont2:=1;
while cont2 < end2 do begin
l:=1;
while (((cont1 + l) - 1) < end1) and (((cont2 + l)-1) < end2)
and (p[(cont1+l)-1] = q[(cont2+l)-1]) do inc(l);
if (l > max) then
begin
max := l;
pos1 := (length(texto) - (length(texto)-(cont1-1)))+1;
pos2 := (length(padrao) - (length(padrao)-(cont2-1)))+1;
end;
inc(cont2);
end;
inc(cont1);
end;
if (max = 1) then Fuzzy_Simil:=0
else begin
sum := max;
if (pos1>1) and (pos2>1)
then sum := sum + Fuzzy_Simil(padrao, texto, pos2-1, pos1-1);
if (((pos1 + max - 1) < tpadrao) and ((pos2 + max - 1) < ttexto)) then
begin
sum := sum + Fuzzy_Simil(copy(padrao,pos2 + max - 1,length(padrao)), copy(texto,pos1 + max - 1,length(texto)), tpadrao - pos2 - max , ttexto - pos1 - max);
end;
Fuzzy_Simil:=sum-1;
end;
end;
function TfMain.Similar(padrao, texto:string):variant;
var resultado:integer;
total:variant;
begin
resultado:=Fuzzy_Simil(padrao, texto, length(padrao), length(texto));
total:=(100*(resultado * 2 / ( length(padrao) + length(texto) )));
Similar:=total;
end;
http://rumkin.com/reference/algorithms/fuzzy_strings/