Using the Sitecore open source AdvancedDatabaseCrawler Lucene indexer

From Sitecore 6.5, Sitecore is deprecating the old Lucene search method. This simply means that you can no longer use the current, built in Lucene search, but has to use a new built in Lucene search.

This is article 1 of 3 articles:

Part 1 – Configuring the index: Using the Sitecore open source AdvancedDatabaseCrawler Lucene indexer
Part 2 – Simple search: Get latest news using Sitecore AdvancedDatabaseCrawler Lucene index
Part 3 – Multivalue search: Get items based on Metadata using Sitecore AdvancedDatabaseCrawler Lucene index

There are many good reasons to use the new way of indexing, but it also requires you to redo some of your previous work. The old way of indexing was easy to setup and easy to use. The new way is more complex because it lets you do more advanced stuff.

That’s why Alex Shyba wrote an open source module that makes the searching and indexing easier: The Advanced Database Crawler. You can find the module here:

http://trac.sitecore.net/AdvancedDatabaseCrawler/browser/Branches/v2/

I have always felt that open source modules are the best way of implementing other developers unsolvable bugs. But I have tried this module, and it works.

I will now show you how to set up the module, and in 2 later posts, I will show you how to perform the 2 basic tasks that you do with an index: How to get the latest news, and how to get all items with a certain set of metadata categories.

The new indexing applies to newer versions of Sitecore. Sitecore 6.2 revision 5 should do it, but from 6.4 and forward you are certain that it will work.

First you need to compile the AdvancedDatabaseCrawler. When compiled you get some dlls:

  • Sitecore.SharedSource.SearchCrawler.dll
  • Sitecore.SharedSource.SearchCrawler.DynamicFields.dll
  • Sitecore.SharedSource.SearchDemo.dll
  • Sitecore.SharedSource.Searcher.dll

The Sitecore.SharedSource.SearchDemo.dll is not needed in production, but it implements some test pages (found in /sitecore modules/Web/searchdemo) that you can use to test your index. Copy the DLL’s to your Sitecore /bin/ folder and you are ready to go. Copy the /sitecore modules/Web/searchdemo items to test your index with Alex’s samples.

Now you need to set up an index. I will show you how to set up an index that crawls the WEB database, as I’m going to use the index for frontend indexing.

Create an ???.config file and put it in the /App_Config/Include folder. Add the following

<configuration xmlns:x="http://www.sitecore.net/xmlconfig/">
  <sitecore>
    <databases>
      <database id="web" singleInstance="true" type="Sitecore.Data.Database, Sitecore.Kernel">
        <Engines.HistoryEngine.Storage>
          <obj type="Sitecore.Data.$(database).$(database)HistoryStorage, Sitecore.Kernel">
            <param connectionStringName="$(id)" />
            <EntryLifeTime>30.00:00:00</EntryLifeTime>
          </obj>
        </Engines.HistoryEngine.Storage>
        <Engines.HistoryEngine.SaveDotNetCallStack>false</Engines.HistoryEngine.SaveDotNetCallStack>
      </database>
    </databases>
    <search>
      <configuration>
        <indexes>
          <index id="web" type="Sitecore.Search.Index, Sitecore.Kernel">
            <param desc="name">$(id)</param>
            <param desc="folder">web</param>
            <Analyzer ref="search/analyzer" />
            <locations hint="list:AddCrawler">
              <master type="Sitecore.SharedSource.SearchCrawler.Crawlers.AdvancedDatabaseCrawler,Sitecore.SharedSource.SearchCrawler">
                <Database>web</Database>
                <Root>/sitecore/content</Root>
                <IndexAllFields>true</IndexAllFields>
                <fieldCrawlers hint="raw:AddFieldCrawlers">
                  <fieldCrawler type="Sitecore.SharedSource.SearchCrawler.FieldCrawlers.LookupFieldCrawler,Sitecore.SharedSource.SearchCrawler" fieldType="Droplink" />
                  <fieldCrawler type="Sitecore.SharedSource.SearchCrawler.FieldCrawlers.DateFieldCrawler,Sitecore.SharedSource.SearchCrawler" fieldType="Datetime" />
                  <fieldCrawler type="Sitecore.SharedSource.SearchCrawler.FieldCrawlers.DateFieldCrawler,Sitecore.SharedSource.SearchCrawler" fieldType="Date" />
                  <fieldCrawler type="Sitecore.SharedSource.SearchCrawler.FieldCrawlers.NumberFieldCrawler,Sitecore.SharedSource.SearchCrawler" fieldType="Number" />
                </fieldCrawlers>
                <!-- If a field type is not defined, defaults of storageType="NO", indexType="UN_TOKENIZED" vectorType="NO" boost="1f" are applied-->
                <fieldTypes hint="raw:AddFieldTypes">
                  <!-- Text fields need to be tokenized -->
                  <fieldType name="single-line text" storageType="NO" indexType="TOKENIZED" vectorType="NO" boost="1f" />
                  <fieldType name="multi-line text" storageType="NO" indexType="TOKENIZED" vectorType="NO" boost="1f" />
                  <fieldType name="word document" storageType="NO" indexType="TOKENIZED" vectorType="NO" boost="1f" />
                  <fieldType name="html" storageType="NO" indexType="TOKENIZED" vectorType="NO" boost="1f" />
                  <fieldType name="rich text" storageType="NO" indexType="TOKENIZED" vectorType="NO" boost="1f" />
                  <fieldType name="memo" storageType="NO" indexType="TOKENIZED" vectorType="NO" boost="1f" />
                  <fieldType name="text" storageType="NO" indexType="TOKENIZED" vectorType="NO" boost="1f" />
                  <!-- Multilist based fields need to be tokenized to support search of multiple values -->
                  <fieldType name="multilist" storageType="NO" indexType="TOKENIZED" vectorType="NO" boost="1f" />
                  <fieldType name="treelist" storageType="NO" indexType="TOKENIZED" vectorType="NO" boost="1f" />
                  <fieldType name="treelistex" storageType="NO" indexType="TOKENIZED" vectorType="NO" boost="1f" />
                  <fieldType name="checklist" storageType="NO" indexType="TOKENIZED" vectorType="NO" boost="1f" />
                  <!-- Legacy tree list field from ver. 5.3 -->
                  <fieldType name="tree list" storageType="NO" indexType="TOKENIZED" vectorType="NO" boost="1f" />
                </fieldTypes>
              </master>
            </locations>
          </index>
        </indexes>
      </configuration>
    </search>
  </sitecore>
</configuration>

A short explanation:

The /sitecore/databases/database items creates a HistoryEngine on the WEB database. This is needed for indexing at all. No HistoryEngine, no index.

The /sitecore/search/configuration/indexes/index is the actual index. This is taken straight from Alex Shyba’s own examples and defines an index called “web” that contains everything (all items, all fields) from the WEB database.

Read more about setting up indexes here.

This it it. You cannot use Sitecore to rebuild the index anymore. You need to either use the /sitecore modules/Web/searchdemo/RebuildDatabaseCrawlers.aspx or write your own simple code:

JobOptions options = new JobOptions("RebuildSearchIndex", "index", Sitecore.Client.Site.Name, "web", "Rebuild");
options.AfterLife = TimeSpan.FromMinutes(1.0);
Job job = JobManager.Start(options);

In the following posts I will demonstrate how to get the latest news, and how to get all items with a certain set of metadata categories.

More stuff to read:

About briancaos

Developer at Pentia A/S since 2003. Have developed Web Applications using Sitecore Since Sitecore 4.1.
This entry was posted in c#, Sitecore 6 and tagged , , , . Bookmark the permalink.

21 Responses to Using the Sitecore open source AdvancedDatabaseCrawler Lucene indexer

  1. You can also rebuild the index using the IndexViewer :)

    Cheers
    Jens

    Like

  2. Pingback: Get latest news using Sitecore AdvancedDatabaseCrawler Lucene index « Brian Pedersen’s Sitecore and .NET Blog

  3. Pingback: Get items based on Metadata using Sitecore AdvancedDatabaseCrawler Lucene index « Brian Pedersen’s Sitecore and .NET Blog

  4. Pingback: Sitecore poor database performance « Brian Pedersen’s Sitecore and .NET Blog

  5. if you want to index properties of an item like ID, TemplateId etc

    indexing Sitecore item properties in Lucene

    Cheers

    Mortaza

    Like

  6. Let’s see if you also interested in How to index PDF content with Lucene AdvancedDatabaseCrawler in Sitecore

    How to index PDF content with Lucene AdvancedDatabaseCrawler in Sitecore

    Old Buddy

    Like

  7. Its a great post.I did same thing that u have posted here.but i got some errors, i am not able to build indexes.It showing me an error like “constructor is not there”,I was able to build indexes when I was using “master type=”Sitecore.Search.Index” but with master type=”Sitecore.SharedSource.SearchCrawler.Crawlers.AdvancedDatabaseCrawler” I am getting error. so Do u provide any package regarding this.Thanks

    Like

  8. Hi, Can you please help me out in Numeric range.I m not able to search any price range.What should I have to take care during range search. To search price I have used NumericRangeDemoPage.aspx page from AdvancedDatabaseCrawler but not getting any result.Give some suggestions.Please reply soon.Thanks

    Like

  9. natash ajaz, were you able to resolve the “constructor is not there” problem?

    Both Numeric Range and Date Ragne class’s in AdvancedDatabaseCrawler are using RangeQuery which is built-in Lucene.net class.

    The problem might not be the query that you are building. It could be that the values are not getting indexed properly.

    Please use index viewer http://marketplace.sitecore.net/en/Modules/Index_Viewer.aspx to evaluate indexed data

    Mortaza

    Like

  10. Hi, Yes I have resolved my problems,Actually with NumericRange search AdvancedDatabaseCrawler doing some manipulations with provided number that’s why I was not able to find to search result,Now its working for me.Thank you.

    Like

  11. Hi,Numeric range with AdvancedDatabaseCrawler is working fine,but i found that sometime it shows irrelevant results.like if my price range between 1 to 2 , then Crawler find for price having number 1 and 2 but I want the products between 1 to 2 range only.Do you have any suggestions for such type of implementations so that i get proper result between defined range?

    Like

  12. Can anyone help us in numeric range Issue,its urgent.

    Like

  13. I really don’t understand your question…

    The problem with AdvancedDatabaseCrawler is that it stores everything in a field called _content in fact it creates multiple fields with the same name so when you search for something then it also finds it everything else.

    You need to fix this problem in protected override void AddAllFields method in AdvancedDatabaseCrawler and add your own lucene field and ask for the field when you search.

    I am not happy with AdvancedDatabaseCrawler and in fact I am not even happy with Sitecore class DatabaseCrawler that the AdvancedDatabaseCrawler inherits from.

    I have disassembled the code for both AdvancedDatabaseCrawler and Sitecore class DatabaseCrawler for tons of workaround…

    Like

  14. Actually I want to search numeric range using lucene.If I provide Price range like $1 to $5 it should give me a result as products that are in the defined price range.I worked for numeric range using lucene.Lucene is able to search for range but I found that it shows me an irrelevant output.
    Like in my scenario if I want to search products between price range 1 and 5,then lucene shows me product having price like 11.30,It may be because of ,Lucene search string would look like “price:[1 to 5]”.Critically, this would actually match a record with value “11.30”. It looks like the first character in “11.30” is ‘1’, and that’s between the ‘1’ and ‘5’,so the whole string “11.30” is between “1” and “5”.
    I want lucene should search for products that having price in between 1 and 5 not 11.30.

    To search range I have used range query as

    public static string FormatNumber(int number)
    {
    return FormatNumber((double)(number));
    }

    protected void AddNumericRangeQuery(BooleanQuery query, NumericRangeField range, BooleanClause.Occur occurance)
    {
    Term lowerTerm = new Term(range.FieldName, FormatNumber(range.Start));
    Term upperTerm = new Term(range.FieldName, FormatNumber(range.End));
    RangeQuery query1 = new RangeQuery(lowerTerm, upperTerm, true);
    query.Add(query1, occurance);
    }

    Can you please help me out in numeric range.Please reply as soon as possible.Thank you.Its Urgent

    Like

  15. I really don’t understand why you keep repeating the same thing over and over…
    I already told you what the problem is please read my previous post and I also told you how to resolve it…
    Create your own lucene field and call it “price” then ask for that field during the search…
    As I told you before, the problem is not your query; it is because everything is stored under multiple fields with the same name called “_content” and probably somewhere the query finds 1 2 3 4 5 somewhere else so that’s why it returns those products

    your statment is not true “It looks like the first character in “11.30″ is ’1′, and that’s between the ’1′ and ’5′,so the whole string “11.30″ is between “1″ and “5″.” Lucene does not work like that

    Like

  16. Hi,
    Have you thoroughly gone through my problem? Actually I know , where is the problem and what is the problem but I want to know the approach to solve the problem.Anyways I my able to solve my problems.Thank you very much.And Obviously , When I am using range query then I have to pass field Name and suggestions that u have given , I already did that.Thank you.

    Like

  17. Goutam says:

    Hi,
    Will this work with sitecore 6.6, .NET 4 and MVC 4? I downloaded the source code from [http://trac.sitecore.net/AdvancedDatabaseCrawler/browser/Branches/v2/] but when I open the source code the namespace is scSearchContrib.Searcher, scSearchContrib.Searcher.Parameters etc. In the demo video and your comment above the namespace is Sitecore.SharedSource.Searcher, basically the namespace starts with Sitecore.SharedSource in the video.
    Is there a different download location?
    Please advice

    Thank you

    Like

  18. briancaos says:

    The source code has been moved to the Sitecore Marketplace, http://marketplace.sitecore.net/en/Modules/Search_Contrib.aspx where you can find a link to GitHub https://github.com/sitecorian/SitecoreSearchContrib from where you can download the latest. You should contact the contributor Alex Shyba (http://profile.sitecore.net/Profile.aspx?userId=PepQhVftadcxkB7%2f1N9H2Am-Q1NYkAQxrzweORlBhzY%3d) for more information on it’s compatibility with Sitecore 6.6.

    The code becomes obsolete with Sitecore 7, where the indexing has been improved radically, see more here: https://briancaos.wordpress.com/2013/04/10/sitecore-7-is-comming-my-wish-list-for-sitecore-7-01/

    Like

  19. Please forgive my ignorance but what does the 30.00:00:00 tag and value do exactly?
    Thank you for any and all help!

    Like

  20. Arjunan says:

    Hi
    I am getting the following issue when i try to rebuild the index.Can you please help me to resolve this issue?

    I have copied the necessary DLLs and Added the config entry.What could be the reason for this issue?

    Could not resolve type name: Sitecore.SharedSource.SearchCrawler.Crawlers.AdvancedDatabaseCrawler,Sitecore.SharedSource.SearchCrawler (method: Sitecore.Configuration.Factory.CreateFromTypeName(XmlNode configNode, String[] parameters, Boolean assert)).

    Like

  21. Pingback: Adding “or” to the Template Filter of Advanced Database Crawler – Core Competency

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.