Library Powered by BSD http://bsd.org/ FreeBSD http://freebsd.org/ MySQL http://mysql.com/ Apache http://apache.org/ PHP http://php.net/ PHP http://SourceForge.net/
© August 2007 Lorentz JÄNTSCHI && Sorana Daniela BOLBOACĂ

up one level Up

Size ratio:
NBT_A1BT_A2CL_A1CL_A2DR_A1DR_A2HS_A1HS_A2RN_A1RN_A2
BT_A11.0000001.0725811.0020551.0710101.0110571.0821011.3685691.0710101.3880461.425926
BT_A20.9323311.0000000.9342470.9985360.9426401.0088761.2759590.9985361.2941181.329435
CL_A10.9979491.0703811.0000001.0688141.0089841.0798821.3657621.0688141.3851991.423002
CL_A20.9336981.0014660.9356161.0000000.9440221.0103551.2778301.0000001.2960151.331384
DR_A10.9890641.0608500.9910961.0592971.0000001.0702661.3536011.0592971.3728651.410331
DR_A20.9241290.9912020.9260270.9897510.9343471.0000001.2647330.9897511.2827321.317739
HS_A10.7306900.7837240.7321920.7825770.7387700.7906801.0000000.7825771.0142311.041910
HS_A20.9336981.0014660.9356161.0000000.9440221.0103551.2778301.0000001.2960151.331384
RN_A10.7204370.7727270.7219180.7715960.7284040.7795860.9859680.7715961.0000001.027290
RN_A20.7012990.7521990.7027400.7510980.7090530.7588760.9597750.7510980.9734351.000000
Similar text [as %]:
This calculates the similarity between two strings as described in Oliver [1993].
NBT_A1BT_A2CL_A1CL_A2DR_A1DR_A2HS_A1HS_A2RN_A1RN_A2
BT_A1100.0047.4097.3750.4175.4648.5382.3952.6072.797.96
BT_A247.19100.0047.3894.5152.2962.6737.8193.4110.3470.63
CL_A197.4347.38100.0049.4075.9548.3682.5652.0272.328.13
CL_A248.7194.5150.18100.0048.2861.7435.8194.3610.4167.89
DR_A175.7449.3876.3747.00100.0038.9463.9948.2852.0626.45
DR_A242.7760.1643.4658.1336.73100.0040.5660.4115.8846.93
HS_A182.3935.3582.5634.5862.5638.08100.0038.1974.059.07
HS_A253.2393.3352.0994.5149.4166.1540.49100.009.9266.81
RN_A172.7920.9372.3220.5844.2222.1174.059.34100.0051.35
RN_A218.6470.6314.0066.5625.8040.6215.1866.9738.94100.00
The Levenshtein distance is defined as the minimal number of characters you have to replace, insert or delete to transform str1 into str2.
int levenshtein ( string str1, string str2 [, int cost_ins, int cost_rep, int cost_del])
A second variant will take three additional parameters that define the cost of insert, replace and delete operations. This is more general and adaptive than variant one, but not as efficient.
It has limit of 255 characters. We will extend it.
Max chars: 1463
Levenshtein from 0 to 255
Levenshtein from 255 to 510
Levenshtein from 510 to 765
Levenshtein from 765 to 1020
Levenshtein from 1020 to 1275
Levenshtein from 1275 to 1530
Levenshtein [1,1,1]:
NBT_A1BT_A2CL_A1CL_A2DR_A1DR_A2HS_A1HS_A2RN_A1RN_A2
BT_A109037089346990942988410311061
BT_A2903089298902463882115921941
CL_A170892089144090344888310331059
CL_A289398891089649288376921944
DR_A1469902440896090973189510161044
DR_A29094639034929090871487890923
HS_A14298824488837318710874650671
HS_A2884115883768954878740923940
RN_A11031921103392110168906509230428
RN_A21061941105994410449236719404280
Max chars: 1463
Levenshtein from 0 to 255
Levenshtein from 255 to 510
Levenshtein from 510 to 765
Levenshtein from 765 to 1020
Levenshtein from 1020 to 1275
Levenshtein from 1275 to 1530
Levenshtein [1,10,1]:
NBT_A1BT_A2CL_A1CL_A2DR_A1DR_A2HS_A1HS_A2RN_A1RN_A2
BT_A10133710313257261345454132714111443
BT_A21337013201701337762121520013041190
CL_A11031320013127031340471131614041440
CL_A21325170131201345798121315013021248
DR_A17261337703134501347924133714071451
DR_A21345762134079813470120779012681298
HS_A14541215471121392412070120510391053
HS_A21327200131615013377901205013021250
RN_A1141113041404130214071268103913020710
RN_A2144311901440124814511298105312507100
Data source:
NameReference
BT_A1>ipi|NP_001029211|NP_001029211 COLLAGEN ALPHA-1(I) Bos Taurus
BT_A2>ipi|NP_776945|NP_776945.1 COLLAGEN ALPHA-2(I) Bos Taurus
CL_A1>ipi|NP_001003090|NP_001003090.1 COLLAGEN ALPHA-1(I) Canis lupus familiaris
CL_A2>ipi|NP_001003187|NP_001003187.1 COLLAGEN ALPHA-2(I) Canis lupus familiaris
DR_A1>ipi|NP_954684|NP_954684.1 COLLAGEN ALPHA-1(I) Danio rerio
DR_A2>ipi|CAK05064|CAK05064.1 COLLAGEN ALPHA-2(I) Danio rerio
HS_A1>ipi|CAA67261|CAA67261.1 COLLAGEN ALPHA-1(I) Homo sapiens
HS_A2>ipi|AAH42586|AAH42586.1 COLLAGEN ALPHA-2(I) Homo sapiens
RN_A1>ipi|IPI00188909|IPI00188909.2 COLLAGEN ALPHA-1(I) CHAIN PRECURSOR.
RN_A2>ipi|IPI00188921|IPI00188921.1 COLLAGEN ALPHA-2(I) CHAIN PRECURSOR.
References:

Levenshtein distance

In information theory and computer science, the Levenshtein distance is a string metric which is one way to measure edit distance. The Levenshtein distance between two strings is given by the minimum number of operations needed to transform one string into the other, where an operation is an insertion, deletion, or substitution of a single character. It is named after Vladimir Levenshtein, who considered this distance in 1965 [V. I. Levenshtein, Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10 (1966):707710]. It is useful in applications that need to determine how similar two strings are.
This algorithm is based on the Wagner-Fischer algorithm for edit distance. Here is pseudocode for a function LevenshteinDistance that takes two strings, s of length m, and t of length n, and computes the Levenshtein distance between them:
int LevenshteinDistance(char s[1..m], char t[1..n])
   // d is a table with m+1 rows and n+1 columns
   declare int d[0..m, 0..n]
 
   for i from 0 to m
       d[i, 0] := i
   for j from 1 to n
       d[0, j] := j
 
   for i from 1 to m
       for j from 1 to n
           if s[i] = t[j] then cost := 0
                          else cost := 1
           d[i, j] := minimum(
                                d[i-1, j] + 1,     // deletion
                                d[i, j-1] + 1,     // insertion
                                d[i-1, j-1] + cost   // substitution
                            )
 
   return d[m, n]
http://www.csse.monash.edu.au/~dld/Publications/1993/Dowe+Oliver+Dix+Allison+Wallace1993_Decision_Graph_Explanation.html Oliver, J.J., Dowe, D.L., Wallace, C.S., Inferring decision graphs using the minimum message length principle, (1992) Proceedings of the 5th Australian Joint Conference on Artificial Intelligence, pp. 361-367. Cited 9 times. Eds., A. Adams and L. Sterling, Singapore: World Scientific Oliver, J.J., Hand, D.J., Introduction to minimum encoding inference (1994) Technical Report TR 94-4. Cited 8 times. Department of Statistics, Open University, Walton Hall, Milton Keynes, MK7 6AA, UK. Also available as TR 205, Department of Computer Science, Monash University, Clayton, Victoria 3168, Australia, http://www.cs.monash.edu.au/~jono Oliver, J.J., Baxter, R.A., (1994) MML and Bayesianism: Similarities and Differences. Cited 15 times. Technical report TR 206, Dept. of Computer Science, Monash University, Clayton, Victoria 3168, Australia, http://www.cs.monash.edu.au/~jono
function TfMain.Fuzzy_Simil(padrao, texto: string; tpadrao,ttexto:integer):variant;
var max, l, pos1, pos2, sum, end1, end2, cont1, cont2 : integer;
    p, q : string;
begin
   max := 1;    end1 :=  ttexto + 1;   end2 :=  tpadrao + 1;   pos1:=1;
   pos2:=1;   p := texto;   cont1:=1;
   while cont1 < end1 do begin
         q:=padrao;
         cont2:=1;
             while cont2 < end2 do begin
               l:=1;
                    while (((cont1 + l) - 1) < end1) and (((cont2 + l)-1) < end2)
                     and (p[(cont1+l)-1] = q[(cont2+l)-1]) do inc(l);
               if (l > max) then
                    begin
                      max := l;
                      pos1 := (length(texto) - (length(texto)-(cont1-1)))+1;
                      pos2 := (length(padrao) - (length(padrao)-(cont2-1)))+1;
                    end;
               inc(cont2);
             end;
         inc(cont1);
   end;
   if (max = 1) then Fuzzy_Simil:=0
      else begin
             sum := max;
             if (pos1>1) and (pos2>1)
                 then sum := sum + Fuzzy_Simil(padrao, texto, pos2-1, pos1-1);
             if (((pos1 + max - 1) < tpadrao) and ((pos2 + max - 1) < ttexto)) then
                begin
                  sum := sum + Fuzzy_Simil(copy(padrao,pos2 + max - 1,length(padrao)), copy(texto,pos1 + max - 1,length(texto)), tpadrao - pos2 - max , ttexto - pos1 - max);
                end;
             Fuzzy_Simil:=sum-1;
           end;
end;


function TfMain.Similar(padrao, texto:string):variant;
var resultado:integer;
    total:variant;
begin
  resultado:=Fuzzy_Simil(padrao, texto, length(padrao), length(texto));
  total:=(100*(resultado * 2 / ( length(padrao) + length(texto) )));
  Similar:=total;
end;
http://rumkin.com/reference/algorithms/fuzzy_strings/