Monday, November 01, 2004

SharePoint multilinguage features

When you are living in a country which has 3 official languages like me - yes we have Dutch, French and German in Belgium - multilanguage definitely becomes an issue when doing a portal implementation. Unfortunately multilanguage is not a very strong feature in SharePoint. Windows SharePoint Services allows you to choose a different language at the moment of site creation when you install language packs, once the site is created it is not possible to switch the language anymore. For SharePoint Portal Server it is even worse, it only support the language of the SharePoint installation, so if you want to have a French Portal you have to install the French version of SharePoint Portal Server. There are however some articles which can guide you, but better multilanguage functionality is definitely something I want to see in the next version of SharePoint. One of the things I noticed which were very hard to find where the way of working of thesaurus files in SharePoint.The best documentation is found in the SPS 2001 Resource Kit and still seems to be correct. Another interesting document is the one on Technet which describes the international features of SharePoint 2003.

I tried out the feature of expansion sets in SharePoint 2003 and how it works with files which have different languages and different formats. SharePoint uses thesaurus files to enhance its search functionality. The thesaurus allows you not only to search for the search term, but also for synonyms and other matching words, like words with the same stem. You can expand the thesaurus by adding tags to the thesaurus file(s). For example, when a user searches for ‘apple*’, you want to automatically search for ‘pomme*’ and ‘appel*’ as well so that documents containing the French and Dutch translations of the word ‘sugar’ will be added to the search results. This is an example of stemming. An expansion set for the word "author" is for example that it searches for documents containing "writer"

There are different thesaurus files for every language, these are xml files - e.g. tsneu.xml is the neutral thesaurus file,tseng.xml is for English, tsnld.xml is for Dutch. The easiest way to test it, is to modify the xml files at local_drive\Program Files\SharePoint Portal Server\Data\Applications\Application UID\Config and then restart the service Microsoft SharePointPS Search.
I added the following lines to the xml files
<sub weight="0.5">author</sub>
<sub weight="0.5">writer</sub>
Well it seems that depending of the file type a different thesaurus file will be used, this happens because of the way how iFilters are implemented.
Then I tried the search for different file types :

  • A plain text file : Uses tsneu.xml and none of the language specific ones

  • A word document : Works with tsenu.xml (english) and also with tsneu.xml (neutral). So for office documents the language is actually recognized

  • PDF files don’t seem to use the thesaurus files at all. I tried it with iFilter 5.0 and it didn’t use any of the thesaurus files. It seems that iFilter 6.0 has been released last week so I'll try that one later

  • Miscellaneous SharePoint links:

    No comments: