Getting into Higher Order Functions

Wednesday, May 18, 2016

Recently I've developed a love for F# and functional programming principals in general. Something about the way I code in those languages makes the code just seem to work the first time.

After a short break from c# I came back to it for a project at work and noticed my c# code was just working the first time too. Curious to know why I started reviewing my recent code and found a primary reason for this is a reduced amount of intermediate variables and far greater usage of LINQ's higer order functions as opposed to iterative loops.

To be clear, c# has had Linq for 7 years now (since VS 2008) and I had used it often, but my default was still to start with foreach(var item in collection) and perform operations inside rather than specifying which type of operation I wanted to perform.

As our use case, we'll take a look at various examples of a full text search of a directory's files.

Shopping cart approach

This was the old standby for me. Over the years I've written this same sort of thing countless times.

This approach has the form of 1. Create a result variable 2. Iterate through a list adding some items to the result 3. Return the result

public IEnumerable<FileInfo> Search(DirectoryInfo directory, string term)
{
    var result = new List<FileInfo>();
    var files = directory.GetFiles("*.txt");
    foreach (var file in files)
    {
        var text = File.ReadAllText(file.FullName);
        if (text.Contains(term)) result.Add(file);
    }
    return result;
}

This approach has 5 variables in scope and forces us to read the entirety of the method to really understand we're just taking a subset of the FileInfo objects returned from the call to .GetFiles().

Using Yield

C# 2.0 introduced the yield keyword which does makes this code a bit cleaner.

public IEnumerable<FileInfo> Search(DirectoryInfo directory, string term)
{
    var files = directory.GetFiles("*.txt");
    foreach (var file in files)
    {
        var text = File.ReadAllText(file.FullName);
        if (text.Contains(term)) yield return file;
    }
}

This approach reduces our variable count because we no longer have to declare our result variable. However, we still have to read all of the method to discover we're just returning a subset of items from the array.

The method's name could be changed to be more descriptive, but that's really not much better than a comment because the compiler doesn't care what my name is or whether it matches with the actual implementation of the method.

Returning a subset of elements in an array

That's really what the method does right? Luckily this is a common requirement and something many languages have. Most languages call this filter though LINQ was first marketed as a way to query data so the decision was made to mirror the sql keyword Where.

public IEnumerable<FileInfo> Search(DirectoryInfo directory, string term)
{
    return directory
        .GetFiles("*.txt")
        .Where(file => File.ReadAllText(file.FullName).Contains(term));
}

With this approach there are only the 2 arguments and the lambda to deal with. Since we know that .Where() returns a subset of elements in a list then a cursory glance at this method is all we need to determine what it does.

Returning just the file names

So basically we want to take a collection of one type and return a collection of a different type. (FileInfo to String). Staying consistent with SQL terminology, LINQ uses .Select() while most languages call this .map

public IEnumerable<string> GetFileNames(DirectoryInfo directory, string term)
{
    return directory
        .GetFiles("*.txt")
        .Select(file => file.Name);
}

In this example we're returning the names of every .txt file in the directory.

If we wanted to return the file names for all matching files we could easily chain calls like so.

public IEnumerable<string> GetMatchingFileNames(DirectoryInfo directory, string term)
{
    return directory
        .GetFiles("*.txt")
        .Where(file => File.ReadAllText(file.FullName).Contains(term))
        .Select(file => file.Name);
}

To me this code reads very clean and is code I would want to maintain. Note that while it may look like the code is iterating through the list twice, c# actually streams the results so that it only enumerates once.

Returning the total file size of all matching files

All of these examples have started and ended with a generic collection. We may have changed the type of the underlying collection (.Select, map), but we still returned a collection. This example takes a collection and returns a long. Said in a generic way, we're taking an IEnumerable<T> and returning a T2.

public long GetMatchingFileSize(DirectoryInfo directory, string term)
{
    return directory
        .GetFiles("*.txt")
        .Where(file => File.ReadAllText(file.FullName).Contains(term))
        .Select(file => file.Length)
        .Aggregate((previous, current) => previous + current);
}

In this case the example is a bit contrived since we could use the .Sum helper method. But the important part is to understand the concept of .Aggregate or reduce.

Reference