Saturday, July 30, 2011

SharePoint Search 2007 – hacking the SiteData web service – Part II

In the first part of this posting I outlined the physical architecture of the SharePoint search engine mostly concerning the part of the Sts3 protocol handler and the role that the “SiteData” web service plays in the crawling process. The very fact that the search engine uses a web service to access and retrieve SharePoint list and site data gave me the first clue as to how I can “hack” the search crawling process. My idea was simple – since the web service is hosted in IIS, of course, I can use some basic URL rewriting techniques so that the call to the real web service is “covertly” redirected to a custom web service which will either implement the original logic of the standard service adding some additional functionality or will simply serve as a proxy to the real web service but will do some modifications to either the input or the output of the latter. Out of these two options the first one seemed more than complex and the second one actually was pretty sufficient as to the goals that I had with the implementation of the “hack”. The thing is that the output XML of the SiteData.GetContent method contains all relevant SharePoint list item and schema data – the “List” option for the ObjectType parameter returns the schema of the SharePoint list and the “Folder” option – the list item data (see the sample XML outputs of the web service in the first part of the posting). The problem is that the Sts3 protocol handler “interprets” the data from the output XML in its own specific way which results in the well-known limitations of the crawled properties and the retrieved for them data in the search index that we have for SharePoint content. So what I decided to do was to create a small custom web service which implements the SiteData.GetContent and SiteData.GetChanges methods (with the exact same parameters and notation). Since I wanted to use it as a proxy to the real SiteData web service I needed somehow to pass the call to it. The simplest option here would have been to simply issue a second web service call from my web service, but the better solution was to just instantiate an instance of the SiteData web service class (Microsoft.SharePoint.SoapServer.SiteData from the STSSOAP assembly which is in the _app_bin subfolder of the SharePoint web application) and call its corresponding method. The last trick of the “hack” was to get the XML output from the SiteData GetContent and SiteData.GetChanges methods and modify it (actually add some additional stuff to it) so that I can get the extra crawled properties in the search index that I needed.

So, before going into details about the implementation of the “hack”, I want to point out several arguments as to why you should consider twice before starting using it (it’s become a habit of mine trying to dissuade people from using my own solutions) and I would rather not recommend using it in bigger production environments:

  • It tampers with the XML output of the standard SiteData web service – this may lead to unpredictable behavior of the index engine and result in it being not able to crawl your site(s). The standard XML output of the SiteData service is itself not quite well-formed XML so before getting the right way to modify it without losing its original formatting I kept receiving crawler errors which I could find in the crawl log of my SSP admin site.
  • There will be a serious performance penalty compared to using just the standard SiteData service. The increased processing time comes from the added parsing of the output XML and the extra modifications and additions added to it.
  • The general argument that this is indeed a hack which gets inside the standard implementation of the SharePoint search indexing which won’t sound good to both managers and Microsoft guys alike.

Having said that (and if you are still reading) let me give you the details of the implementation itself. The solution of the “hack” can be downloaded from here (check the installation notes below).

The first thing that I will start with is the URL rewriting logic that allows the custom web service to be invoked instead of the standard SiteData web service. In IIS 7 there is a built-in support for URL rewriting, but because I was testing on a Windows 2003 server with IIS 6 and because I was a bit lazy to implement a full proxy for the SiteData web service I went to the other approach … Which is to use a custom .NET HTTP module (the better solution) or to simply modify the global.asax of the target SharePoint web application (the worse but easier to implement solution) – which is the one that I actually used. The advantage of using a custom URL rewriting logic as opposed to using the built in URL rewriting functionality in IIS 7 is that in the former you can additionally inspect the HTTP request data and apply the URL rewriting only for certain methods of the web service. So in the modified version of the global.asax I do an explicit check for the web service method being called and redirect to the custom web service only if I detect the GetContent or GetChanges methods (all other methods will hit directly the standard SiteData service and no URL rewriting will take place). You can see the source code of the global.asax file that I used below:

<%@ Application Language="C#" Inherits="Microsoft.SharePoint.ApplicationRuntime.SPHttpApplication" %>

<script language="C#" runat="server">

protected void Application_BeginRequest(Object sender, EventArgs e)

{

    CheckRewriteSiteData();

}

protected void CheckRewriteSiteData()

{

    if (IsGetListItemsCall())

    {

        string newUrl = this.Request.Url.AbsolutePath.ToLower().Replace("/sitedata.asmx", "/stefansitedata.asmx");

        HttpContext.Current.RewritePath(newUrl);

    }

}

protected bool IsGetListItemsCall()

{

    if (string.Compare(this.Request.ServerVariables["REQUEST_METHOD"], "post", true) != 0) return false;

    if (!this.Request.Url.AbsolutePath.EndsWith("/_vti_bin/SiteData.asmx", StringComparison.InvariantCultureIgnoreCase)) return false;

 

    if (string.IsNullOrEmpty(this.Request.Headers["SOAPAction"])) return false;

 

    string soapAction = this.Request.Headers["SOAPAction"].Trim('"').ToLower();

    if (!soapAction.EndsWith("getcontent") && !soapAction.EndsWith("getchanges")) return false;

    if (string.Compare(ConfigurationManager.AppSettings["UseSiteDataRewrite"], "true", true) != 0) return false;

 

    return true;

}

</script>

Note also that in the code I check a custom “AppSettings” key in the web.config file whether to use or not URL rewriting logic. This way you can easily turn on or off the “hack” with a simple tweak in the configuration file of the SharePoint web application.

And this is the code of the custom “SiteData” web service:

[WebServiceBinding(ConformsTo = WsiProfiles.BasicProfile1_1), WebService(Namespace = "http://schemas.microsoft.com/sharepoint/soap/")]

public class SiteData

{

    [WebMethod]

    public string GetContent(ObjectType objectType, string objectId, string folderUrl, string itemId, bool retrieveChildItems, bool securityOnly, ref string lastItemIdOnPage)

    {

        try

        {

            SiteDataHelper siteDataHelper = new SiteDataHelper();

            return siteDataHelper.GetContent(objectType, objectId, folderUrl, itemId, retrieveChildItems, securityOnly, ref lastItemIdOnPage);

        }

        catch (ThreadAbortException) { throw; }

        catch (Exception exception) { throw SoapServerException.HandleException(exception); }

    }

    [WebMethod]

    public string GetChanges(ObjectType objectType, string contentDatabaseId, ref string LastChangeId, ref string CurrentChangeId, int Timeout, out bool moreChanges)

    {

        try

        {

            SiteDataHelper siteDataHelper = new SiteDataHelper();

            return siteDataHelper.GetChanges(objectType, contentDatabaseId, ref LastChangeId, ref CurrentChangeId, Timeout, out moreChanges);

        }

        catch (ThreadAbortException) { throw; }

        catch (Exception exception) { throw SoapServerException.HandleException(exception); }

    }

}

As you see, the custom “SiteData” web service implements only the GetContent and GetChanges methods. We don’t need to implement the other methods of the standard SiteData web service because the URL rewriting will redirect to the custom web service only in case these two methods are being invoked. The two methods in the custom service have the exact same notation as the ones in the standard SiteData web service. The implementation of the methods is a simple delegation to the corresponding methods with the same names of a helper class: SiteDataHelper. Here is the source code of the SiteDataHelper class:

using SP = Microsoft.SharePoint.SoapServer;

namespace Stefan.SharePoint.SiteData

{

    public class SiteDataHelper

    {

        public string GetChanges(ObjectType objectType, string contentDatabaseId, ref string startChangeId, ref string endChangeId, int Timeout, out bool moreChanges)

        {

            SP.SiteData siteData = new SP.SiteData();

            string res = siteData.GetChanges(objectType, contentDatabaseId, ref startChangeId, ref endChangeId, Timeout, out moreChanges);

            try

            {

                ListItemXmlModifier modifier = new ListItemXmlModifier(new EnvData(), res);

                res = modifier.ModifyChangesXml();

            }

            catch (Exception ex) { Logging.LogError(ex); }

            return res;

        }

 

        public string GetContent(ObjectType objectType, string objectId, string folderUrl, string itemId, bool retrieveChildItems, bool securityOnly, ref string lastItemIdOnPage)

        {

            SPWeb web = SPContext.Current.Web;

            SP.SiteData siteData = new SP.SiteData();

            string res = siteData.GetContent(objectType, objectId, folderUrl, itemId, retrieveChildItems, securityOnly, ref lastItemIdOnPage);

            try

            {

                EnvData envData = new EnvData() { SiteId = web.Site.ID, WebId = web.ID, ListId = objectId.TrimStart('{').TrimEnd('}') };

                if ((objectType == ObjectType.ListItem || objectType == ObjectType.Folder) && !securityOnly)

                {

                    ListItemXmlModifier modifier = new ListItemXmlModifier(envData, res);

                    res = modifier.ModifyListItemXml();

                }

                else if (objectType == ObjectType.List)

                {

                    ListItemXmlModifier modifier = new ListItemXmlModifier(envData, res);

                    res = modifier.ModifyListXml();

                }

            }

            catch (Exception ex) { Logging.LogError(ex); }

            return res;

        }

    }

}

The thing to note here is that the two methods in the SiteDataHelper helper class create an instance of the SiteData web service class directly (note that this is not a generated proxy class, but the actual web service class implemented in the standard STSSOAP.DLL). The GetContent and GetChanges methods are called on this instance respectively and the string result of the calls is stored in a local variable. The string value that these methods return actually contains the XML with the list schema or list item data (depending on the “ObjectType” parameter being “List” or “Folder”). This XML data is then provided to an instance of the custom ListItemXmlModifier class which handles all XML modifications for both the GetContent and GetChanges methods. Note that for the GetContent method, the XML results are passed for modification only if the “ObjectType” parameter has the “ListItem”, “Folder” or “List” values. I am not going to show the source code of the ListItemXmlModifier class directly in the posting (it is over 700 lines of code) but instead I will briefly explain to you what are the changes in the XML from the GetContent and GetChanges methods that this class implements. The modifications to the XML are actually pretty simple and semantically there are only two types of changes – these correspond to the result XML-s of the GetContent (ObjectType=List) and GetContent (ObjectType=Folder) methods (the result XML of the GetChanges method has a more complex structure but contains the above two fragments (in one or more occurrences) where list and list item changes are available).

Let’s start with a sample XML from the standard SiteData.GetContent(ObjectType=List) method (I’ve trimmed some of the elements for brevity):

<List>

  <Metadata ID="{1d53a556-ae9d-4fbf-8917-46c7d97ebfa5}" LastModified="2011-01-17 13:24:18Z" Title="Pages" DefaultTitle="False" Description="This system library was created by the Publishing feature to store pages that are created in this site." BaseType="DocumentLibrary" BaseTemplate="850" DefaultViewUrl="/Pages/Forms/AllItems.aspx" DefaultViewItemUrl="/Pages/Forms/DispForm.aspx" RootFolder="Pages" Author="System Account" ItemCount="4" ReadSecurity="1" AllowAnonymousAccess="False" AnonymousViewListItems="False" AnonymousPermMask="0" CRC="699748088" NoIndex="False" ScopeID="{a1372e10-8ffb-4e21-b627-bed44a5130cd}" />

  <ACL>

    <permissions>

      <permission memberid='3' mask='9223372036854775807' />

      ....

    </permissions>

  </ACL>

  <Views>

    <View URL="Pages/Forms/AllItems.aspx" ID="{771a1809-e7f3-4c52-b346-971d77ff215a}" Title="All Documents" />

    ....

  </Views>

  <Schema>

    <Field Name="FileLeafRef" Title="Name" Type="File" />

    <Field Name="Title" Title="Title" Type="Text" />

    <Field Name="Comments" Title="Description" Type="Note" />

    <Field Name="PublishingContact" Title="Contact" Type="User" />

    <Field Name="PublishingContactEmail" Title="Contact E-Mail Address" Type="Text" />

    <Field Name="PublishingContactName" Title="Contact Name" Type="Text" />

    <Field Name="PublishingContactPicture" Title="Contact Picture" Type="URL" />

    <Field Name="PublishingPageLayout" Title="Page Layout" Type="URL" />

    <Field Name="PublishingRollupImage" Title="Rollup Image" Type="Note" TypeAsString="Image" />

    <Field Name="Audience" Title="Target Audiences" Type="Note" TypeAsString="TargetTo" />

    <Field Name="ContentType" Title="Content Type" Type="Choice" />

    <Field Name="MyLookup" Title="MyLookup" Type="Lookup" />

    ....

  </Schema>

</List>

The XML contains the metadata properties of the queried SharePoint list, the most important part of which is contained in the Schema/Field elements – the simple definitions of the fields in this list. It is easy to deduce that the fields that the index engine encounters in this part of the XML will be recognized and appear as crawled properties in the search index. So what if we start adding fields of our own – this won’t solve the thing by itself because we will further need list items with values for these “added” fields (we’ll see that in the second XML sample) but it is the first required bit of the “hack”. The custom service implementation will actually add several extra “Field” elements like these:

    <Field Name='ContentTypeId.Text' Title='ContentTypeId' Type='Note' />

    <Field Name='Author.Text' Title='Created By' Type='Note' />

    <Field Name='Author.ID' Title='Created By' Type='Integer' />

    <Field Name='MyLookup.Text' Title='MyLookup' Type='Note' />

    <Field Name='MyLookup.ID' Title='MyLookup' Type='Integer' />

    <Field Name='PublishingRollupImage.Html' Title='Rollup Image' Type='Note' />

    <Field Name='PublishingPageImage.Html' Title='Page Image' Type='Note' />

    <Field Name='PublishingPageContent.Html' Title='Page Content' Type='Note' />

    <Field Name='Env.SiteId' Title='Env.SiteId' Type='Text' />

    <Field Name='Env.WebId' Title='Env.WebId' Type='Text' />

    <Field Name='Env.ListId' Title='Env.ListId' Type='Text' />

    <Field Name='Env.IsListItem' Title='Env.IsListItem' Type='Integer' />

You can immediately notice that these “new” fields are actually related to already existing fields in the SharePoint list in the schema XML that’s being modified. You can see that I used a specific naming convention for the “Name” attribute – with a dot and a short suffix. Actually the crawled properties that the index engine will generate will also contain the dot and the suffix, so it will be easy for you to locate them in the “crawled properties” page in the SSP admin site. From the “Name” attribute you can immediately see which the related original fields for the new fields are. In short the rules for creating these new fields are:

  • For every original lookup field (both single and multiple lookup columns and all derived lookup columns, e.g. user fields) two additional fields are added – with the suffixes “.ID” and “.Text” and field “Type” attribute “Integer” and “Note” respectively.
  • For every original publishing “HTML” and “Image” field one extra field with the “.Html” suffix is added.
  • For all lists the “ContentTypeId.Text” extra field is added with “Type” attribute set to “Note”
  • For all lists the additional fields “Env.SiteId”, “Env.WebId”, “Env.ListId”, “Env.IsListItem” are added.

So, we have already extra fields in the list schema, the next step is to have them in the list item data populated with the relevant values. Let me first show you a sample of the standard unmodified XML output of the GetContent(ObjectType=Folder) method (I trimmed some of the elements and reduced the values of some of the attributes for brevity):

<Folder>

  <Metadata>

    <scope id='{5dd2834e-902d-4db0-8db2-4a1da762a620}'>

      <permissions>

        <permission memberid='1' mask='206292717568' />

        ....

      </permissions>

    </scope>

  </Metadata>

  <xml xmlns:s='uuid:BDC6E3F0-6DA3-11d1-A2A3-00AA00C14882' xmlns:dt='uuid:C2F41010-65B3-11d1-A29F-00AA00C14882' xmlns:rs='urn:schemas-microsoft-com:rowset' xmlns:z='#RowsetSchema'>

    <s:Schema id='RowsetSchema'>

      <s:ElementType name='row' content='eltOnly' rs:CommandTimeout='30'>

        <s:AttributeType name='ows_ContentTypeId' rs:name='Content Type ID' rs:number='1'>

          <s:datatype dt:type='int' dt:maxLength='512' />

        </s:AttributeType>

        <s:AttributeType name='ows__ModerationComments' rs:name='Approver Comments' rs:number='2'>

          <s:datatype dt:type='string' dt:maxLength='1073741823' />

        </s:AttributeType>

        <s:AttributeType name='ows_FileLeafRef' rs:name='Name' rs:number='3'>

          <s:datatype dt:type='string' dt:lookup='true' dt:maxLength='512' />

        </s:AttributeType>

        ....

      </s:ElementType>

    </s:Schema>

    <scopes>

    </scopes>

    <rs:data ItemCount='2'>

      <z:row ows_ContentTypeId='0x010100C568DB52D9D0A14D9B2FDCC96666E9F2007948130EC3DB064584E219954237AF390064DEA0F50FC8C147B0B6EA0636C4A7D400E595F4AC9968CC4FAD1928288BC9885A' ows_FileLeafRef='1;#Default.aspx' ows_Modified_x0020_By='myserver\sstanev' ows_File_x0020_Type='aspx' ows_Title='Home' ows_PublishingPageLayout='http://searchtest/_catalogs/masterpage/defaultlayout.aspx, Welcome page with Web Part zones' ows_ContentType='Welcome Page' ows_PublishingPageImage='' ows_PublishingPageContent='some content' ows_ID='1' ows_Created='2010-12-20T18:53:18Z' ows_Author='1;#Stefan Stanev' ows_Modified='2010-12-26T12:45:31Z' ows_Editor='1;#Stefan Stanev' ows__ModerationStatus='0' ows_FileRef='1;#Pages/Default.aspx' ows_FileDirRef='1;#Pages' ows_Last_x0020_Modified='1;#2010-12-26T12:45:32Z' ows_Created_x0020_Date='1;#2010-12-20T18:53:19Z' ows_File_x0020_Size='1;#6000' ows_FSObjType='1;#0' ows_PermMask='0x7fffffffffffffff' ows_CheckedOutUserId='1;#' ows_IsCheckedoutToLocal='1;#0' ows_UniqueId='1;#{923EEE29-44AB-4D1B-B65B-E3ECEAE1353E}' ows_ProgId='1;#' ows_ScopeId='1;#{5DD2834E-902D-4DB0-8DB2-4A1DA762A620}' ows_VirusStatus='1;#6000' ows_CheckedOutTitle='1;#' ows__CheckinComment='1;#' ows__EditMenuTableStart='Default.aspx' ows__EditMenuTableEnd='1' ows_LinkFilenameNoMenu='Default.aspx' ows_LinkFilename='Default.aspx' ows_DocIcon='aspx' ows_ServerUrl='/Pages/Default.aspx' ows_EncodedAbsUrl='http://searchtest/Pages/Default.aspx' ows_BaseName='Default' ows_FileSizeDisplay='6000' ows_MetaInfo='...' ows__Level='1' ows__IsCurrentVersion='1' ows_SelectTitle='1' ows_SelectFilename='1' ows_Edit='0' ows_owshiddenversion='26' ows__UIVersion='6656' ows__UIVersionString='13.0' ows_Order='100.000000000000' ows_GUID='{2C80A53D-4F38-4494-855D-5B52ED1D095B}' ows_WorkflowVersion='1' ows_ParentVersionString='1;#' ows_ParentLeafName='1;#' ows_Combine='0' ows_RepairDocument='0' ows_ServerRedirected='0' />

      ....

    </rs:data>

  </xml>

</Folder>

The list item data is contained below the “rs:data” element – there is one “z:row” element for every list item. The attributes of the “z:row” element contain the field values of the corresponding list item. You can see here that the attributes already have the “ows_” prefix as all crawl properties in the “SharePoint” category. You can notice that the attributes for lookup fields contain the unmodified item field data but the publishing “HTML” and “Image” columns are already modified – all HTML markup has been removed from them (for the “Image” type column this means that they become empty, since all the data they contain is in markup).

And let’s see the additional attributes that the custom web service adds to the “z:row” elements of the list item data XML:

ows_ContentTypeId.Text='0x010100C568DB52D9D0A14D9B2FDCC96666E9F2007948130EC3DB064584E219954237AF3900242457EFB8B24247815D688C526CD44D0005E464F1BD83D14983E49C578030FBF6'

ows_PublishingPageImage.Html='&lt;img border="0" src="/PublishingImages/newsarticleimage.jpg" vspace="0" style="margin-top:8px" alt=""&gt;'

ows_PublishingPageContent.Html='&lt;b&gt;some content&lt;/b&gt;'

ows_Author.Text='1;#Stefan Stanev'

ows_Author.ID='1'

ows_Editor.Text='1;#Stefan Stanev'

ows_Editor.ID='1'

ows_Env.SiteId='ff96067a-accf-4763-8ec1-194f20fbf0f5'

ows_Env.WebId='b2099353-41d6-43a7-9b0d-ab6ad87fb180'

ows_Env.ListId='Pages'

ows_Env.IsListItem='1'

Here is how these fields are populated/formatted (as I mentioned above all the data is retrieved from the XML itself or from the context of the service request):

  • the lookup derived fields with the “.ID” and “.Text” suffixes – both get their values from the “parent” lookup column – the former is populated with the starting integer value of the lookup field, the latter is set with the unmodified value of the original lookup column. When the search index generates the corresponding crawled properties the “.ID” field can be used as a true integer property and the “.Text” one although containing the original lookup value will be treated as a simple text property by the index engine (remember that in the list schema XML the type of this extra field was set to “Note”). So what will be the difference between the “.Text” field and the original lookup column in the search index. The difference is that the value of the original lookup column will be trimmed in the search index and will contain only the text value without the integer part preceding it. And if you issue an SQL search query against a managed property mapped to the crawled property of a lookup field you will be able to retrieve only the textual part of the lookup value (this holds also for the filtering and sorting operation for this field type). Whereas with the “.Text” derived field you will have access to the original unmodified value of the lookup field.
  • the fields derived from the “HTML” and “Image” publishing field type with the “.Html” suffix – they are populated with the original values of the corresponding fields with the original markup intact. Since the values of the original fields in the list item data XML are already trimmed the original values are retrieved with a simple trick. The “z:row” element for every list item contains the “ows_MetaInfo” attribute which contains a serialized property bag with all properties of the underlying SPFile object for the current list item. This property bag happens to contain all list item field values which are non-empty. So what I do in this case is to parse the value of the ows_MetaInfo attribute and retrieve the unmodified values for all “Html” and “Image” fields that I need. An important note here – the ows_MetaInfo attribute (and its corresponding system list field – MetaInfo) is available only for document libraries and is not present in non-library lists, which means that this trick is possible only for library-type lists.
  • the ows_ContentTypeId.Text field gets populated from the value of the original ows_ContentTypeId field/attribute. The difference between the two is that the derived one is defined in the schema as a “Note” field so its value is treated by the search index as a text property.
  • the ows_Env.*, fields get populated from service contextual data (see the implementation of the SiteDataHelper class). For the implementation of the XML modifications for the SiteData.GetChanges method these values are retrieved from the result XML itself. The value of the ows_Env.IsListItem is always set to 1 (its purpose is to be used as a flag defining a superset of the standard “isdocument” managed property).

Installation steps for the “hack” solution

  1. Download and extract the contents of the zip archive.
  2. Build the project in Visual Studio (it is a VS 2008 project).
  3. The assembly file (Stefan.SharePoint.SiteData.dll) should be deployed to the GAC
  4. The StefanSiteData.asmx file should be copied to {your 12 hive root}\ISAPI folder
  5. The global.asax file should be copied to the root folder of your SharePoint web application. Note that you will have to backup the original global.asax file before you overwrite it with this one.
  6. Open the web.config file in the target SharePoint web application and add an “appSettings” “add” element with key “UseSiteDataRewrite” and value “true”.
  7. Note that if you have more than one front end servers in your farm you should repeat steps 3-6 on all machines.
  8. After the installation is ready you can start the search crawler (full crawl) from the SSP admin site. It is a good idea if you have a content source only for the web application for which the custom SiteData service is enabled, so that you can see immediately the results of the custom service.
  9. After the crawling is complete you should check the crawl log for errors – check whether there’re unusual errors which were not occurring before the installation of the custom SiteData service.
  10. If there’re no errors in the crawl log you can check the “crawled properties” page in the SSP admin site – the new “dotted” crawled properties should be there and you can now create new managed properties that can be mapped to them.
  11. Note that the newly created managed properties are not ready for use before you run a second full crawl for the target content source.