Saturday, December 10, 2011

How to prevent an item from being indexed with FAST for SharePoint

Yes, it can be done!

[Update 2012-05-25: There is an even better hack to solve this issue as Kai wrote in a commen below this post. Check out Segei Sergeev answer in this TechNet thread.]

It’s Saturday, my kid has gone to sleep, and I finally have time to tell you guys the good news. Preventing an item from being index, or to paraphrase, to drop a document in the document processing pipeline is indeed possible!

You can already prevent items from being indexed by limiting SharePoint lists and libraries from being crawled with library settings and you can use crawl rules to exclude certain url patterns. But what I am talking about is preventing an item from being indexed based business rules in your organization and looking at the meta data of the item or inside the text of a file.

There are many scenarios for not wanting an item searchable. You might want to prevent indexing items in your organization which contains the super secret codename “wobba”, or items of a certain ContentType. When indexing file shares you don't have much meta data to go on at all for excluding items, so creating your own module with the proper rules might be the only way to go.

Up until this post, this was not easily doable with FAST for SharePoint. With built-in SharePoint Search it’s still impossible (unless you create a custom crawler).
There has been at least a couple of threads about this at the FS4SP TechNet forum
and we have long time concluded that this is not an easy task, and cannot be done in a supported manner. (Supported meaning not editing config files which the documentation on TechNet tells us not to touch.)

I had sort of thought about how to do this earlier, but I didn’t figure it out before reading Leonardo Souza’s blog post the other day about: How "Remove Duplicate Results" works in FAST Search for SharePoint.

Leo talks about a property called “documentsignaturecontribution” which can be used to append data to the document signature checksum in the FS4SP content processing pipeline. But in order to assign data to this property you have to create a managed property by that  exact name, and output your custom data to a crawled property of your choosing which is mapped to the managed property.

The reason why you have to work with a managed property is because the document signature stage appears after the stage which maps crawled properties to managed properties, and all stages below the mapper stage works on managed properties. This find by Leo is just so cool, and there is no documentation on this anywhere as far as I’ve seen.

So, over to our problem. Which stage runs just before the document signature stage and comes to our aid?

The “Offensive Content Filter” stage

This stage also has an additional attribute where you can assign data, named “ocfcontribution”. There’s only vague documentation on MSDN on how to assign data to this field, which refers to using the XMLMapper. Using the XMLMapper means indexing xml documents, and this is a bit limiting.

The thing about the offensive content filter is that it will prevent documents from being indexed if they contain a certain about of bad language. If you get embarrassed by such words, then skip reading :)

So now we have a stage which can drop items, the rest is to assign enough bad words to “ocfcontribution” to get above the threshold it triggers on.

First off enable the Offensive Content Filter by editing C:\FASTSearch\etc\config_data\DocumentProcessor\optionalprocessing.xml

Next create a managed property called “ocfcontribution” of type “Text”, and also a crawled property with this name. The guid for the property set is one I have chosen for a test group in my system. Replace it with to suit your own system.
$mp = New-FASTSearchMetadataManagedProperty -Name ocfcontribution -Type 1
$cp = New-FASTSearchMetadataCrawledProperty -Name ocfcontribution  -Propset fa585f53-2679-48d9-976d-9ce62e7e
19b7 -VariantType 31
New-FASTSearchMetadataCrawledPropertyMapping -ManagedProperty $mp -CrawledProperty $cp

In order to test this I have created an xml file named “drop.xml” which I placed in C:\FASTSearch\pipelinemodules with the following contents

<?xml version="1.0" encoding="utf-8"?>
<Document>
   <CrawledProperty propertySet="fa585f53-2679-48d9-976d-9ce62e7e19b7" propertyName="ocfcontribution" varType="31">fuck shit porn cunt cock dick</CrawledProperty>
</Document>

Next I added the following custom extensibility stage to C:\FASTSearch\etc\pipelineextensibility.xml

<Run command="copy C:\FASTSearch\pipelinemodules\drop.xml %(output)s">
    <Output>
        <CrawledProperty propertySet="fa585f53-2679-48d9-976d-9ce62e7e19b7" varType="31" propertyName="ocfcontribution"/>
    </Output>
</Run>

This stage will for each item assign the contents of drop.xml to “ocfcontribution”, effectively dropping all items, which is ok for test purposes. You would of course create a custom module instead which has your business rules for when an item should be dropped.

Next issue “psctrl reset” to reload your config files and use for example “docpush” to index an item, and it will not be indexed, as the output below shows.

PS C:\temp> docpush -c test sample.txt
[2011-12-10 20:31:02.677] ERROR      test Reported error with http://cohowinery.com/sample.txt: processing:OffensiveConte
ntFilter:ERROR: Processor error status: NotPassing
[2011-12-10 20:31:03.678] INFO       test All add operations completed

I hope someone will find this trick useful, and it seems you can use English words for to trigger the filter, no matter the language of your items.

PS! If you enable the Offensive Content Filter and have content with explicit language, you could risk some items not being indexed with this method.

16 comments:

  1. Excellent work connecting the dots to enable a feature that so many have requested! I look forward to trying this out myself!

    PS! And cool picture to warn about the content ;-)

    Thomas Svensen

    ReplyDelete
  2. Once again a great post!

    Will the links from a filtered item still be extracted for crawling purposes? Similar to the SharePoint option to crawl without indexing?

    Thanks!

    ReplyDelete
  3. mwiselka, yes, as links are extracted from the SharePoint crawler and not within the pipeline.

    ReplyDelete
  4. Wow...

    Amazing !

    Thank you so much !

    Amir

    ReplyDelete
  5. Amir,
    if you can use this that's great :) Let me know if the method accidentally removes content you wish to keep... which can be a side effect depending on your content.

    ReplyDelete
  6. Hi Mikael,
    I am sorry for asking you a problem which is in different context from above blog article. Here goes my problem..

    I wanted to know how to access the value of "contentid" property in code ( in pipeline ), so that I can copy its value in a custom crawled property and then map this crawled property to a custom queryable managed property. It is present neither in Managed Properties list on the SharePoint 2010 server nor shown by the Powershell command "Get-FASTSearchMetadataManagedProperty -name contentid" on the FAST Server.

    Even after adding the Spy stages in the pipeline , I get the value of "contentid" in a property "docid" in the "Spy.txt" file. But unfortunately the "docid" that is present in the "Spy.txt file is not a crawled property. Following is an excerpt from the file

    "@@@@ ROUTING docid: ssic://1191"

    Please reply as soon as possible. Thanks in advance.

    ReplyDelete
  7. You can edit a parameter in a config file to allow docid to be used in custom modules. But, you can access the "contentid" field from the xml returned during search as it's always returned. But it's not a property you can access like you mention.

    What's the use case? (And it's better to ask questions as http://social.technet.microsoft.com/Forums/en-us/fastsharepoint/threads :) )

    ReplyDelete
    Replies
    1. Thanks for the reply.
      Which parameter do I need to edit and in which config file to allow docid to be used in custom modules?

      Actually I require some selected results to be shown on the results page using fixed query. Currently I am using the 'path' property of the results in the querystring but as the length of the querystring in a fixed query is limited( 2048 chars ), I am unable to show large number of selected results. The value in the "contentid" field is unique and is compact , so I can get comparatively large number of selected results by using it in the querystring. Since the "contentid" is not queryable, I need to copy its value in a crawled property and then map that crawled property to a custom queryable managed property so that I can use it in the querystring to retrieve selected results.

      Please suggest a solution to the above problem..really need your help as I had already posted the problem in "http://social.technet.microsoft.com/Forums/en-US/fastsharepoint/thread/0773aebf-3a67-475e-a0fa-1367bdd00585/" thread (in its continuation) but no one replied.

      Sandip

      Delete
    2. This comment has been removed by the author.

      Delete
  8. Hi, Mikael,
    I've posted answer on msdn forum (http://social.technet.microsoft.com/Forums/en-US/fastsharepoint/thread/835bb78e-3bdf-4ed3-8ac1-aa3ce389fdd5#16640113-ce82-4c32-9e3f-9cedeeef4018) about another possible solution, what do you think about it?
    Thank you.

    ReplyDelete
    Replies
    1. Very nice trick indeed :D I will edit my post to reflect this.

      Delete
  9. Hi Mikael,
    Thank You! This is a superb post! I had always thought that offensive content filtering is the stage which could somehow be tweaked to drop items i didnt want in my index but could not actually quite figure out how.
    This is great!

    ReplyDelete
    Replies
    1. Hi,
      Be sure to read the thread I link to at the top of this article as it's a better approach in my opinion :)

      Delete
    2. Hi Svenson,
      Can you please tell us how FAST knows that those words given in "Drop.xml" are offensive. Is that those words are there in some file. If yes, can you please tell us which file is it?

      Delete
    3. Hi,
      In FAST ESP you had access to the offensive word dictionaries, but not in FS4SP. You just have to guess. But trust me, those files contain a lot of terms (if they use the same files as ESP which I suspect they do)

      Thanks,
      Mikael Svenson

      Delete
    4. Hi Svenson,
      Thanks a lot for your reply.

      Delete