Computing Jaccard index with random permutation – Passion is like genius; a miracle.

Jaccard index에 대한 언급은 여기에서 앞서서 했음. Jaccard Index는 두개 셋간의 유사성을 계산하므로.. 예를들면 두 집합(bag of words라던가)간의 유사성 평가에 사용할 수 있음.

아래는 http://en.wikipedia.org/wiki/Locality-sensitive_hashing 에서 퍼온 내용. (ACM Communication에도 실려있음.)

Suppose
U
is composed of subsets of some ground set of enumerable items
S
and the similarity function of interest is the [[Jaccard index| Jaccard index]]
J
. If
\pi
is a permutation on the indices of
S
, for
A \subseteq S
let
h(A) = \min_{a \in A} \{ \pi(a) \}
. Each possible choice of
\pi
defines a single hash function
h
mapping input sets to integers.

Define the function family
H
to be the set of all such functions and let
D
be the uniform distribution. Given two sets
A,B \subseteq S
the event that
h(A) = h(B)
corresponds exactly to the event that the minimizer of
\pi
lies inside
A \bigcap B
. As
h
was chosen uniformly at random,
Pr[h(A) = h(B)] = J(A,B)\,
and
(H,D)\,
define an LSH scheme for the Jaccard index.

간단하게 예로 설명하자면.. 두개 집합 A={a, b, c, f, g} 와 B={b, c, h, j}가 있다고 하자. 이 두 집합에 대한 Jaccard Index를 계산하려면 공통된 pair를 일일히 봐야한다.

하지만 만약, random permutation

\pi

가 다음과 같이 정의되었다고 하자.

\pi=\{a=1, b=2, c=3, \cdots\}

. 그러면 집합 A에 대하여 이 permutation

\pi

를 대응 시킨 인덱스는

\{1, 2, 3, 6, 7\}

이고 B는

\pi

에 대해

\{2, 3, 8, 10\}

가 된다. 따라서

h(A)=1, h(B)=2

가 된다.

주어진

\pi

하나에 대한 계산 방법은 위와 같고, 실제로는

\pi

를 uniform distribution에서

개 뽑은다음

h(A) = h(B)

를 만족하는

\pi

가

개 였다면,

k/m

는 Jaccard Index와 높은 확률로 같다고 판정한다.