- SharePoint Deep Dive exploration: SharePoint alerting
- SharePoint Deep Dive exploration: looking into the SharePoint userinfo table
SharePoint Server can detect near duplicates of documents and will take this into account when displaying search results. In this post I will delve a little deeper into the underlying techniques being used. An important thing to keep in mind is that the way that duplicate documents are identified has evolved and changed in the different versions of SharePoint.
SharePoint Server 2007 detected duplicates using a commonly used technique called "shingling". This is a generic technique which allows you to identify duplicates or near duplicates of documents (or webpages). Shingling has been widely used in different types of systems and software to identify spams, plagiarism or to enforce copyright protection. A shingle – which is more more commonly referred to as a q-gram – is a contiguous subsequence of tokens taken from a document.
So if you want to see if two documents are similar, you can do this by looking at how many shingles they have in common. You however need to determine how long your subsequence of tokens needs to be – typically a value of 4 is used. This is formalized by using S(d,w), which is the set of distinct shingles of width w which are contained in a document e.g. for the line “a rose is a rose is a rose” – so with w=4, we get the following shingles “a rose is a”, “rose is a rose”, “is a rose is”. If you wan to compare the similarity between two sets, e.g. S(doc1) and S(doc2) which are the sets of distinct shingles of document1 and document2, you can use the Jaccard similarity index (or resemblance index) to define the degree of similarity. A Jaccard index with a value of 0 means that documents are completely dissimilar, whereas 1 points to identical documents. This would however that we would need to calculate the similarity index of each pair of documents – which would be a quite intensive task – to speed up processing a form of hashing is used (for more details take a look at the explanation about near duplicates and shingling)
As items in SharePoint 2007 were indexed, these hashes were stored in the search database. It is not really clear from the documentation whether these hashes only related to the content of an item or to the properties as well (although this blog - Microsoft Office SharePoint Server 2007: Duplicate search results states that it is only on the content of a document). So in SharePoint Server 2007 these hashes were stored in the MSSDuplicateHashes tables.
In SharePoint Server 2013 these hashes are not stored in the MSSDuplicateHashes table anymore but in the DocumentSignature – this is documented in the article Customizing search results in SharePoint 2013. In the next screenshot I have used the and you will notice that although the document title and some metadata are different for the 5 documents, there are only 2 distinct document signatures. This indicates that the shingle is only calculated using the content of documents and not the metadata or the file name (Content By Search web parts don’t seem to use duplicate trimming). The document signature actually contains 4 checksums and if one of the four matches with another document, the document is treated as a duplicate. This also means that when SharePoint search encounters a document for which it is unable to extract the actual contents, it probably is not able to do proper duplicate trimming.
Since SharePoint Server 2013 search result web parts have duplicate trimming activated and SharePoint 2013 is using a quite coarse algorithm for determining a duplicate, you will see some unexpected results. Luckily after installing the SharePoint 2013 Cumulative Update July 2014 you will have the option to de-activate duplicate trimming within the query builder settings.
Another way to accomplish the same thing is by changing the settings for grouping of results. As outlined in Customizing search results in SharePoint 2013, duplicate removal of search results is a part of grouping. So if you specify to group on DocumentSignature, you would be able to show near duplicates (if one of the 4 checksums is different) but still omit the “complete” duplicates.