AcademicDirect Library Agriculture Colagen StringAnalysis

Up

Size ratio:

N	BT_A1	BT_A2	CL_A1	CL_A2	DR_A1	DR_A2	HS_A1	HS_A2	RN_A1	RN_A2
BT_A1	1.000000	1.072581	1.002055	1.071010	1.011057	1.082101	1.368569	1.071010	1.388046	1.425926
BT_A2	0.932331	1.000000	0.934247	0.998536	0.942640	1.008876	1.275959	0.998536	1.294118	1.329435
CL_A1	0.997949	1.070381	1.000000	1.068814	1.008984	1.079882	1.365762	1.068814	1.385199	1.423002
CL_A2	0.933698	1.001466	0.935616	1.000000	0.944022	1.010355	1.277830	1.000000	1.296015	1.331384
DR_A1	0.989064	1.060850	0.991096	1.059297	1.000000	1.070266	1.353601	1.059297	1.372865	1.410331
DR_A2	0.924129	0.991202	0.926027	0.989751	0.934347	1.000000	1.264733	0.989751	1.282732	1.317739
HS_A1	0.730690	0.783724	0.732192	0.782577	0.738770	0.790680	1.000000	0.782577	1.014231	1.041910
HS_A2	0.933698	1.001466	0.935616	1.000000	0.944022	1.010355	1.277830	1.000000	1.296015	1.331384
RN_A1	0.720437	0.772727	0.721918	0.771596	0.728404	0.779586	0.985968	0.771596	1.000000	1.027290
RN_A2	0.701299	0.752199	0.702740	0.751098	0.709053	0.758876	0.959775	0.751098	0.973435	1.000000

Similar text [as %]:
This calculates the similarity between two strings as described in Oliver [1993].

N	BT_A1	BT_A2	CL_A1	CL_A2	DR_A1	DR_A2	HS_A1	HS_A2	RN_A1	RN_A2
BT_A1	100.00	47.40	97.37	50.41	75.46	48.53	82.39	52.60	72.79	7.96
BT_A2	47.19	100.00	47.38	94.51	52.29	62.67	37.81	93.41	10.34	70.63
CL_A1	97.43	47.38	100.00	49.40	75.95	48.36	82.56	52.02	72.32	8.13
CL_A2	48.71	94.51	50.18	100.00	48.28	61.74	35.81	94.36	10.41	67.89
DR_A1	75.74	49.38	76.37	47.00	100.00	38.94	63.99	48.28	52.06	26.45
DR_A2	42.77	60.16	43.46	58.13	36.73	100.00	40.56	60.41	15.88	46.93
HS_A1	82.39	35.35	82.56	34.58	62.56	38.08	100.00	38.19	74.05	9.07
HS_A2	53.23	93.33	52.09	94.51	49.41	66.15	40.49	100.00	9.92	66.81
RN_A1	72.79	20.93	72.32	20.58	44.22	22.11	74.05	9.34	100.00	51.35
RN_A2	18.64	70.63	14.00	66.56	25.80	40.62	15.18	66.97	38.94	100.00

The Levenshtein distance is defined as the minimal number of characters you have to replace, insert or delete to transform str1 into str2.
int levenshtein ( string str1, string str2 [, int cost_ins, int cost_rep, int cost_del])
A second variant will take three additional parameters that define the cost of insert, replace and delete operations. This is more general and adaptive than variant one, but not as efficient.
It has limit of 255 characters. We will extend it.
Max chars: 1463
Levenshtein from 0 to 255
Levenshtein from 255 to 510
Levenshtein from 510 to 765
Levenshtein from 765 to 1020
Levenshtein from 1020 to 1275
Levenshtein from 1275 to 1530
Levenshtein [1,1,1]:

N	BT_A1	BT_A2	CL_A1	CL_A2	DR_A1	DR_A2	HS_A1	HS_A2	RN_A1	RN_A2
BT_A1	0	903	70	893	469	909	429	884	1031	1061
BT_A2	903	0	892	98	902	463	882	115	921	941
CL_A1	70	892	0	891	440	903	448	883	1033	1059
CL_A2	893	98	891	0	896	492	883	76	921	944
DR_A1	469	902	440	896	0	909	731	895	1016	1044
DR_A2	909	463	903	492	909	0	871	487	890	923
HS_A1	429	882	448	883	731	871	0	874	650	671
HS_A2	884	115	883	76	895	487	874	0	923	940
RN_A1	1031	921	1033	921	1016	890	650	923	0	428
RN_A2	1061	941	1059	944	1044	923	671	940	428	0

Max chars: 1463
Levenshtein from 0 to 255
Levenshtein from 255 to 510
Levenshtein from 510 to 765
Levenshtein from 765 to 1020
Levenshtein from 1020 to 1275
Levenshtein from 1275 to 1530
Levenshtein [1,10,1]:

N	BT_A1	BT_A2	CL_A1	CL_A2	DR_A1	DR_A2	HS_A1	HS_A2	RN_A1	RN_A2
BT_A1	0	1337	103	1325	726	1345	454	1327	1411	1443
BT_A2	1337	0	1320	170	1337	762	1215	200	1304	1190
CL_A1	103	1320	0	1312	703	1340	471	1316	1404	1440
CL_A2	1325	170	1312	0	1345	798	1213	150	1302	1248
DR_A1	726	1337	703	1345	0	1347	924	1337	1407	1451
DR_A2	1345	762	1340	798	1347	0	1207	790	1268	1298
HS_A1	454	1215	471	1213	924	1207	0	1205	1039	1053
HS_A2	1327	200	1316	150	1337	790	1205	0	1302	1250
RN_A1	1411	1304	1404	1302	1407	1268	1039	1302	0	710
RN_A2	1443	1190	1440	1248	1451	1298	1053	1250	710	0

Data source:

Name	Reference
BT_A1	>ipi\|NP_001029211\|NP_001029211 COLLAGEN ALPHA-1(I) Bos Taurus
BT_A2	>ipi\|NP_776945\|NP_776945.1 COLLAGEN ALPHA-2(I) Bos Taurus
CL_A1	>ipi\|NP_001003090\|NP_001003090.1 COLLAGEN ALPHA-1(I) Canis lupus familiaris
CL_A2	>ipi\|NP_001003187\|NP_001003187.1 COLLAGEN ALPHA-2(I) Canis lupus familiaris
DR_A1	>ipi\|NP_954684\|NP_954684.1 COLLAGEN ALPHA-1(I) Danio rerio
DR_A2	>ipi\|CAK05064\|CAK05064.1 COLLAGEN ALPHA-2(I) Danio rerio
HS_A1	>ipi\|CAA67261\|CAA67261.1 COLLAGEN ALPHA-1(I) Homo sapiens
HS_A2	>ipi\|AAH42586\|AAH42586.1 COLLAGEN ALPHA-2(I) Homo sapiens
RN_A1	>ipi\|IPI00188909\|IPI00188909.2 COLLAGEN ALPHA-1(I) CHAIN PRECURSOR.
RN_A2	>ipi\|IPI00188921\|IPI00188921.1 COLLAGEN ALPHA-2(I) CHAIN PRECURSOR.

References:

Levenshtein distance

In information theory and computer science, the Levenshtein distance is a string metric which is one way to measure edit distance. The Levenshtein distance between two strings is given by the minimum number of operations needed to transform one string into the other, where an operation is an insertion, deletion, or substitution of a single character. It is named after Vladimir Levenshtein, who considered this distance in 1965 [V. I. Levenshtein, Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10 (1966):707–710]. It is useful in applications that need to determine how similar two strings are.
This algorithm is based on the Wagner-Fischer algorithm for edit distance. Here is pseudocode for a function LevenshteinDistance that takes two strings, s of length m, and t of length n, and computes the Levenshtein distance between them:

int LevenshteinDistance(char s[1..m], char t[1..n])
   // d is a table with m+1 rows and n+1 columns
   declare int d[0..m, 0..n]
 
   for i from 0 to m
       d[i, 0] := i
   for j from 1 to n
       d[0, j] := j
 
   for i from 1 to m
       for j from 1 to n
           if s[i] = t[j] then cost := 0
                          else cost := 1
           d[i, j] := minimum(
                                d[i-1, j] + 1,     // deletion
                                d[i, j-1] + 1,     // insertion
                                d[i-1, j-1] + cost   // substitution
                            )
 
   return d[m, n]

http://www.csse.monash.edu.au/~dld/Publications/1993/Dowe+Oliver+Dix+Allison+Wallace1993_Decision_Graph_Explanation.html Oliver, J.J., Dowe, D.L., Wallace, C.S., Inferring decision graphs using the minimum message length principle, (1992) Proceedings of the 5th Australian Joint Conference on Artificial Intelligence, pp. 361-367. Cited 9 times. Eds., A. Adams and L. Sterling, Singapore: World Scientific Oliver, J.J., Hand, D.J., Introduction to minimum encoding inference (1994) Technical Report TR 94-4. Cited 8 times. Department of Statistics, Open University, Walton Hall, Milton Keynes, MK7 6AA, UK. Also available as TR 205, Department of Computer Science, Monash University, Clayton, Victoria 3168, Australia, http://www.cs.monash.edu.au/~jono Oliver, J.J., Baxter, R.A., (1994) MML and Bayesianism: Similarities and Differences. Cited 15 times. Technical report TR 206, Dept. of Computer Science, Monash University, Clayton, Victoria 3168, Australia, http://www.cs.monash.edu.au/~jono

function TfMain.Fuzzy_Simil(padrao, texto: string; tpadrao,ttexto:integer):variant;
var max, l, pos1, pos2, sum, end1, end2, cont1, cont2 : integer;
    p, q : string;
begin
   max := 1;    end1 :=  ttexto + 1;   end2 :=  tpadrao + 1;   pos1:=1;
   pos2:=1;   p := texto;   cont1:=1;
   while cont1 < end1 do begin
         q:=padrao;
         cont2:=1;
             while cont2 < end2 do begin
               l:=1;
                    while (((cont1 + l) - 1) < end1) and (((cont2 + l)-1) < end2)
                     and (p[(cont1+l)-1] = q[(cont2+l)-1]) do inc(l);
               if (l > max) then
                    begin
                      max := l;
                      pos1 := (length(texto) - (length(texto)-(cont1-1)))+1;
                      pos2 := (length(padrao) - (length(padrao)-(cont2-1)))+1;
                    end;
               inc(cont2);
             end;
         inc(cont1);
   end;
   if (max = 1) then Fuzzy_Simil:=0
      else begin
             sum := max;
             if (pos1>1) and (pos2>1)
                 then sum := sum + Fuzzy_Simil(padrao, texto, pos2-1, pos1-1);
             if (((pos1 + max - 1) < tpadrao) and ((pos2 + max - 1) < ttexto)) then
                begin
                  sum := sum + Fuzzy_Simil(copy(padrao,pos2 + max - 1,length(padrao)), copy(texto,pos1 + max - 1,length(texto)), tpadrao - pos2 - max , ttexto - pos1 - max);
                end;
             Fuzzy_Simil:=sum-1;
           end;
end;


function TfMain.Similar(padrao, texto:string):variant;
var resultado:integer;
    total:variant;
begin
  resultado:=Fuzzy_Simil(padrao, texto, length(padrao), length(texto));
  total:=(100*(resultado * 2 / ( length(padrao) + length(texto) )));
  Similar:=total;
end;
http://rumkin.com/reference/algorithms/fuzzy_strings/