MATLAB聚类有效性评价指标（外部）

作者：凯鲁嘎吉 - 博客园 http://www.cnblogs.com/kailugaji/

更多内容，请看标签：MATLAB、聚类

前提：数据的真实标签已知！

1. 归一化互信息(Normalized Mutual information)

定义

程序

function MIhat = nmi(A, B)
%NMI Normalized mutual information
% A, B: 1*N;
if length(A) ~= length(B)
    error('length( A ) must == length( B)');
end
N = length(A);
A_id = unique(A);
K_A = length(A_id);
B_id = unique(B);
K_B = length(B_id);
% Mutual information
A_occur = double (repmat( A, K_A, 1) == repmat( A_id', 1, N ));
B_occur = double (repmat( B, K_B, 1) == repmat( B_id', 1, N ));
AB_occur = A_occur * B_occur';
P_A= sum(A_occur') / N;
P_B = sum(B_occur') / N;
P_AB = AB_occur / N;
MImatrix = P_AB .* log(P_AB ./(P_A' * P_B)+eps);
MI = sum(MImatrix(:));
% Entropies
H_A = -sum(P_A .* log(P_A + eps),2);
H_B= -sum(P_B .* log(P_B + eps),2);
%Normalized Mutual information
MIhat = MI / sqrt(H_A*H_B);

结果

>> A = [1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3];
>> B = [1 2 1 1 1 1 1 2 2 2 2 3 1 1 3 3 3];
>> MIhat = nmi(A, B)
 
MIhat =
 
    0.3646

2. Rand统计量(Rand index)

定义

程序

function [AR,RI,MI,HI]=RandIndex(c1,c2)
%RANDINDEX - calculates Rand Indices to compare two partitions
% ARI=RANDINDEX(c1,c2), where c1,c2 are vectors listing the 
% class membership, returns the "Hubert & Arabie adjusted Rand index".
% [AR,RI,MI,HI]=RANDINDEX(c1,c2) returns the adjusted Rand index, 
% the unadjusted Rand index, "Mirkin's" index and "Hubert's" index.
 
if nargin < 2 || min(size(c1)) > 1 || min(size(c2)) > 1
   error('RandIndex: Requires two vector arguments')
   return
end
 
C=Contingency(c1,c2);	%form contingency matrix
 
n=sum(sum(C));
nis=sum(sum(C,2).^2);		%sum of squares of sums of rows
njs=sum(sum(C,1).^2);		%sum of squares of sums of columns
 
t1=nchoosek(n,2);		%total number of pairs of entities
t2=sum(sum(C.^2));	%sum over rows & columnns of nij^2
t3=.5*(nis+njs);
 
%Expected index (for adjustment)
nc=(n*(n^2+1)-(n+1)*nis-(n+1)*njs+2*(nis*njs)/n)/(2*(n-1));
 
A=t1+t2-t3;		%no. agreements
D=  -t2+t3;		%no. disagreements
 
if t1==nc
   AR=0;			%avoid division by zero; if k=1, define Rand = 0
else
   AR=(A-nc)/(t1-nc);		%adjusted Rand - Hubert & Arabie 1985
end
 
RI=A/t1;			%Rand 1971		%Probability of agreement
MI=D/t1;			%Mirkin 1970	%p(disagreement)
HI=(A-D)/t1;	%Hubert 1977	%p(agree)-p(disagree)
 
function Cont=Contingency(Mem1,Mem2)
 
if nargin < 2 || min(size(Mem1)) > 1 || min(size(Mem2)) > 1
   error('Contingency: Requires two vector arguments')
   return
end
 
Cont=zeros(max(Mem1),max(Mem2));
 
for i = 1:length(Mem1)
   Cont(Mem1(i),Mem2(i))=Cont(Mem1(i),Mem2(i))+1;
end

程序中包含了四种聚类度量方法：Adjusted Rand index、Rand index、Mirkin index、Hubert index。

结果

>> A = [1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3];
>> B = [1 2 1 1 1 1 1 2 2 2 2 3 1 1 3 3 3];
>> [AR,RI,MI,HI]=RandIndex(A,B)
 
AR =
 
    0.2429
 
 
RI =
 
    0.6765
 
 
MI =
 
    0.3235
 
 
HI =
 
    0.3529