Document Similarity and Containment

Tags:

On the Resemblance and Containment of Documents

Very popular article on document similarity and containment (Cited 528 times according to Google).

For similarity, minhash I’ve already posted here is discussed.

For containment (document A is contained in B), authors suggest to extract shingles which satisfies 0 mod m, i.e., shingles whose remainder is zero when divided by m, from document A and call it

V(A)

. Then, containment is

|V(A) \cap V(B)|/|V(A)|

. In other words, just get some shingles in good way, and then compute containment directly on them.

Unlike its similarity computation, containment requires for me to extract arbitrary number of shingles from documents though I can limit the number of shingles a bit. I wish I can predefine, as in the case of similarity computation, the number of shingles required to compute containment score.