A couple of days ago, while working on a C# program, I had the necessity of grouping contiguous elements from a sequence given a property. A group needs to be created each time the value of the property changes in a similar way to the
uniq Unix utility.
Python has a function called
groupby which is part of the nice
itertools module which does exactly what I want. For example:
>>> strList = ["abc","ert","bre","sd","ghj","awe","ew","gh"]
>>> [list(g) for i,g in groupby(strList,lambda x: len(x))]
[['abc', 'ert', 'bre'], ['sd'], ['ghj', 'awe'], ['ew', 'gh']]
>>> numList = [1,2,2,2,1,1,1,5,5]
>>> [list(g) for i,g in groupby(numList)]
[[1], [2, 2, 2], [1, 1, 1], [5, 5]]
In .NET a class called
Enumerable with lots of extension methods to manipulate with sequences (
IEnumerable<T>). This class includes an extension method called
GroupBy which groups values according to a key. However it behaves more like
SQL's Group by in that it considers all the values of the collection. For example:
csharp> using System.Collections.Generic;
csharp> var strList = new List<string>() {"abc","ert","bre","sd","ghj","awe","ew","gh"};
csharp> strList.GroupBy(x => x.Length);
{ { "abc", "ert", "bre", "ghj", "awe" }, { "sd", "ew", "gh" } }
csharp> var numList = new List<int>() {1,2,2,2,1,1,1,5,5};
csharp> numList.GroupBy(x => x);
{ { 1, 1, 1, 1 }, { 2, 2, 2 }, { 5, 5 } }
(The C# examples will be presented using the very useful
Mono C# REPL)
Writing the equivalent function in C# turned out to be a nice programming excessive. Also trying to write the function using only "
yield return" turned out the more challenging than I thought!.
As a first try we can write this function using intermediate collections to store the partial groups:
using System.Collections.Generic;
using System;
namespace Langexplr.Experiments
{
public class MyTuple<T,K>
{
public T Item1 { get; set; }
public K Item2 { get; set; }
public static MyTuple<T1,K1> Create<T1,K1>(T1 first, K1 second)
{
return new MyTuple<T1,K1>() {Item1 = first, Item2 = second};
}
}
public static class GroupByTests
{
public static IEnumerable<MyTuple<K,IList<T>>> MyGroupByWithLists<T,K>(this IEnumerable<T> en,Func<T,K> keyExtraction)
{
K currentKey = default(K);
bool firstTime = true;
IList<T> currentGroup = new List<T>();
foreach(var aValue in en)
{
if (firstTime)
{
currentKey = keyExtraction(aValue);
firstTime = false;
}
else
{
K tmpKey = keyExtraction(aValue);
if (!tmpKey.Equals(currentKey))
{
yield return MyTuple<K,IList<T>>.Create(currentKey, currentGroup);
currentGroup = new List<T>();
currentKey = tmpKey;
}
}
currentGroup.Add(aValue);
}
if (currentGroup.Count > 0)
{
yield return MyTuple<K,IList<T>>.Create(currentKey, currentGroup);
}
}
}
}
Here I'm defining a class called
MyTuple
which is very similar to .NET 4' Tuple class to store group's key and members.
This function works as expected, for example:
csharp> strList.MyGroupByWithLists(x => x.Length).Select(x => x.Item2).ToList();
{ { "abc", "ert", "bre" }, { "sd" }, { "ghj", "awe" }, { "ew", "gh" } }
csharp> numList.MyGroupByWithLists(x => x).Select(x => x.Item2).ToList();
{ { 1 }, { 2, 2, 2 }, { 1, 1, 1 }, { 5, 5 } }
One of the interesting things about the Python version of
groupby
is that it doesn't create an intermediate collections for each group. The itertools module reference has the code for the groupby
implementation.
Trying to write a this function with similar characteristics in C# resulted in the following (scary) code:
public static IEnumerable<MyTuple<K,IEnumerable<T>>> MyGroupBy<T,K>(this IEnumerable<T> en,Func<T,K> keyExtraction)
{
K currentGroupKey = default(K);
bool firstTime = true;
bool hasMoreElements = false;
bool yieldNewValue = false;
IEnumerator<T> enumerator = en.GetEnumerator();
hasMoreElements = enumerator.MoveNext();
while (hasMoreElements)
{
if (firstTime)
{
firstTime = false;
yieldNewValue = true;
currentGroupKey = keyExtraction(enumerator.Current);
}
else
{
K lastKey;
while((lastKey = keyExtraction(enumerator.Current)).Equals( currentGroupKey) &&
(hasMoreElements = enumerator.MoveNext()))
{
}
if(hasMoreElements &&
!lastKey.Equals(currentGroupKey))
{
currentGroupKey = lastKey;
yieldNewValue = true;
}
else
{
yieldNewValue = false;
}
}
if (yieldNewValue) {
yield return MyTuple<K,IEnumerable<T>>.Create(
currentGroupKey,
ReturnSubSequence((x) => {
hasMoreElements = enumerator.MoveNext();
return hasMoreElements &&
x.Equals(keyExtraction(enumerator.Current)); },
enumerator,
currentGroupKey,
enumerator.Current));
}
}
}
static IEnumerable<T> ReturnSubSequence<T,K>(Predicate<K> pred, IEnumerator<T> seq,K currentElement,T first)
{
yield return first;
while( pred(currentElement))
{
yield return seq.Current;
}
}
Using this function we can write:
csharp> numList.MyGroupBy().Select(x => x.Item1);
{ 1, 2, 1, 5 }
csharp> numList.MyGroupBy().Select(x => x.Item2.ToList());
{ { 1 }, { 2, 2, 2 }, { 1, 1, 1 }, { 5, 5 } }
csharp> strList.MyGroupBy(s => s.Length).Select(x => x.Item1);
{ 3, 2, 3, 2 }
csharp> strList.MyGroupBy(s => s.Length).Select(x => x.Item2.ToList());
{ { "abc", "ert", "bre" }, { "sd" }, { "ghj", "awe" }, { "ew", "gh" } }
One interesting fact about this way of writing the
groupby
function is that, you have to be very careful handling the resulting iterator/enumerable. From Python's groupby documentation:
The returned group is itself an iterator that shares the underlying iterable with groupby(). Because the source is shared, when the groupby() object is advanced, the previous group is no longer visible
For example in Python, the following effect occurs if we consume the complete iterator before consuming each group:
>>> l = groupby(strList,lambda s: len(s))
>>> consumed = list(groupby(strList,lambda s: len(s)))
>>> [list(g) for k,g in consumed]
[[], ['gh'], [], []]
In our C# version we have a similar restriction, for example:
csharp> var consumed = strList.MyGroupBy(s => s.Length).ToList();
csharp> consumed.Select(x => x.Item2);
{ { "abc" }, { "sd" }, { "ghj" }, { "ew" } }
Code for this post can be found
here.