XML Data Provider

Optimize LINQ queries on XML Data Provider

To start with, let's have a look at this query:

public static string GetPageTitle(Guid pageId)
{
	using (DataConnection connection = new DataConnection())
	{
		IQueryable<IPage> pages = connection.Get<IPage>();

		return (from page in pages
			where page.Id == pageId
			select page.Title).FirstOrDefault();
	}
}

The XML data provider works in rather a simple way. The very first time the code is executed, it will be reading the "IPage" collection from the related XML file (/App_Data/Composite/DataStores/*.xml), and all the queries will be executed on an in-memory collection of those pages (System.Linq.EnumerableQuery<T> class).

The execution time of our GetPageTitle() method can be estimated as:

  1. Getting pages from the XML file: <5ms (and only for the first call)
  2. Compiling LINQ: 7ms
  3. Executing a query: <1ms

So it takes 7ms of the execution time, even if there is only one page in the system. In order to make the execution time shorter, we can use the IEnumerable<> interface, instead of IQueryable<>:

public static string GetPageTitle(Guid pageId)
{
	using (DataConnection connection = new DataConnection())
	{
		IEnumerable<IPage> pages = connection.Get<IPage>();

		return (from page in pages
			where page.Id == pageId
			select page.Title).FirstOrDefault();
	}
}

The new code looks pretty much the same, but if you look at the generated code though .NET decompiler (for example, Reflector or JustDecompile), you will see, that in the first case, we’re building a small expression tree, which requires compilation. In the second case, there’re some method calls over an IEnumerable<IPage> object, which are executed rather quickly.

public static string GetPageTitle(Guid pageId)
{
    ParameterExpression parameterExpression;
    ParameterExpression[] parameterExpressionArray;
    ParameterExpression[] parameterExpressionArray2;
    PageHandler pageHandler = new PageHandler();
    pageHandler.pageId = pageId;
    using (DataConnection connection = new DataConnection())
    {
        IQueryable<IPage> pages = connection.Get<IPage>();
        return ParameterExpression parameterExpression2 = Expression.Parameter(typeof(IPage), "page").Select<IPage,string>(Expression.Lambda<Func<IPage, string>>(Expression.Property(parameterExpression2, (MethodInfo)MethodBase.GetMethodFromHandle(Title)), new ParameterExpression[] { parameterExpression2 })).FirstOrDefault<string>();
    }
}
public static string GetPageTitle2(Guid pageId)
{
    PageHandler pageHandler = new PageHandler();
    pageHandler.pageId = pageId;
    using (DataConnection connection = new DataConnection())
    {
        IEnumerable<IPage> pages = connection.Get<IPage>();
        return (null ? PageHandler.CS$<>9__CachedAnonymousMethodDelegate5 : Func<IPage, bool> func = new Func<IPage, bool>(pageHandler.<GetPageTitle2>b__2)).Select<IPage,string>(new Func<IPage, string>((page) => page.Title)).FirstOrDefault<string>();
    }
}

Let’s sum up what we have learned by now.

TIP 1: It’s better to use IEnumerable<> instead of IQueryable<> interface while working with in-memory kept data, for example, the List<T>() object and data retrieved from the XML data provider.

The other thing that is ‘nice-to-know’ while working EnumerableQuery<> (XML data provider) is that it is too easy to get a bad execution complexity.

If you are not familiar with “Analysis of algorithms”, here is a small example:

using (DataConnection connection = new DataConnection())
{
    var table =
        (from product in connection.Get<IProduct>()
         from productCategory in connection.Get<IProductCategory>()
         where product.Id == productCategory.ProductId
         select new { product.Id, product.Name, productCategoryName = productCategory.Name }).ToList();
}

Although it seems correct (which is - for the SQL data provider), it will have O(n*n) complexity when using the XML data provider. If you have, for example, 10,000 products, it will take 100,000,000 operations to compare  product.Id and productCategory.ProductId. The query may take minutes to execute.

For more information on O(n*n), please read "Analysis of algorithms".

In the current example, the solution will be using a hash table. The optimized code looks like this:

using (DataConnection connection = new DataConnection())
{
    var categoryById = new Composite.Core.Collections.Generic.Hashtable<Guid, IProductCategory>();
    foreach (var category in connection.Get<IProductCategory>())
    {
        categoryById.Add(category.Id, category);
    }

    var table = (from product in connection.Get<IProduct>()
     	select new { product.Id, product.Name, productCategoryName = categoryById[product.Id].Name }).ToList();
}

This code has O(n*lg(n)) complexity. For 10.000 products, the order would be about (10.000 * ln(10.000)) ~= 90,000 operations, and is likely to take less than 10ms to execute.

TIP 2: While using the XML data provider, it’s better not to join more than 2 tables in the same query. If you still have to, make sure that in the case there will be more than just hundreds records, so query will not become a bottleneck that may load the server completely.