Python has a function called groupby which is part of the nice itertools module which does exactly what I want. For example:
>>> strList = ["abc","ert","bre","sd","ghj","awe","ew","gh"]
>>> [list(g) for i,g in groupby(strList,lambda x: len(x))]
[['abc', 'ert', 'bre'], ['sd'], ['ghj', 'awe'], ['ew', 'gh']]
>>> numList = [1,2,2,2,1,1,1,5,5]
>>> [list(g) for i,g in groupby(numList)]
[[1], [2, 2, 2], [1, 1, 1], [5, 5]]
In .NET a class called Enumerable with lots of extension methods to manipulate with sequences (IEnumerable<T>). This class includes an extension method called GroupBy which groups values according to a key. However it behaves more like SQL's Group by in that it considers all the values of the collection. For example:
csharp> using System.Collections.Generic;
csharp> var strList = new List<string>() {"abc","ert","bre","sd","ghj","awe","ew","gh"};
csharp> strList.GroupBy(x => x.Length);
{ { "abc", "ert", "bre", "ghj", "awe" }, { "sd", "ew", "gh" } }
csharp> var numList = new List<int>() {1,2,2,2,1,1,1,5,5};
csharp> numList.GroupBy(x => x);
{ { 1, 1, 1, 1 }, { 2, 2, 2 }, { 5, 5 } }
(The C# examples will be presented using the very useful Mono C# REPL)
Writing the equivalent function in C# turned out to be a nice programming excessive. Also trying to write the function using only "yield return" turned out the more challenging than I thought!.
As a first try we can write this function using intermediate collections to store the partial groups:
using System.Collections.Generic;
using System;
namespace Langexplr.Experiments
{
public class MyTuple<T,K>
{
public T Item1 { get; set; }
public K Item2 { get; set; }
public static MyTuple<T1,K1> Create<T1,K1>(T1 first, K1 second)
{
return new MyTuple<T1,K1>() {Item1 = first, Item2 = second};
}
}
public static class GroupByTests
{
public static IEnumerable<MyTuple<K,IList<T>>> MyGroupByWithLists<T,K>(this IEnumerable<T> en,Func<T,K> keyExtraction)
{
K currentKey = default(K);
bool firstTime = true;
IList<T> currentGroup = new List<T>();
foreach(var aValue in en)
{
if (firstTime)
{
currentKey = keyExtraction(aValue);
firstTime = false;
}
else
{
K tmpKey = keyExtraction(aValue);
if (!tmpKey.Equals(currentKey))
{
yield return MyTuple<K,IList<T>>.Create(currentKey, currentGroup);
currentGroup = new List<T>();
currentKey = tmpKey;
}
}
currentGroup.Add(aValue);
}
if (currentGroup.Count > 0)
{
yield return MyTuple<K,IList<T>>.Create(currentKey, currentGroup);
}
}
}
}
Here I'm defining a class called
MyTuple
which is very similar to .NET 4' Tuple class to store group's key and members.This function works as expected, for example:
csharp> strList.MyGroupByWithLists(x => x.Length).Select(x => x.Item2).ToList();
{ { "abc", "ert", "bre" }, { "sd" }, { "ghj", "awe" }, { "ew", "gh" } }
csharp> numList.MyGroupByWithLists(x => x).Select(x => x.Item2).ToList();
{ { 1 }, { 2, 2, 2 }, { 1, 1, 1 }, { 5, 5 } }
One of the interesting things about the Python version of
groupby
is that it doesn't create an intermediate collections for each group. The itertools module reference has the code for the groupby implementation.Trying to write a this function with similar characteristics in C# resulted in the following (scary) code:
public static IEnumerable<MyTuple<K,IEnumerable<T>>> MyGroupBy<T,K>(this IEnumerable<T> en,Func<T,K> keyExtraction)
{
K currentGroupKey = default(K);
bool firstTime = true;
bool hasMoreElements = false;
bool yieldNewValue = false;
IEnumerator<T> enumerator = en.GetEnumerator();
hasMoreElements = enumerator.MoveNext();
while (hasMoreElements)
{
if (firstTime)
{
firstTime = false;
yieldNewValue = true;
currentGroupKey = keyExtraction(enumerator.Current);
}
else
{
K lastKey;
while((lastKey = keyExtraction(enumerator.Current)).Equals( currentGroupKey) &&
(hasMoreElements = enumerator.MoveNext()))
{
}
if(hasMoreElements &&
!lastKey.Equals(currentGroupKey))
{
currentGroupKey = lastKey;
yieldNewValue = true;
}
else
{
yieldNewValue = false;
}
}
if (yieldNewValue) {
yield return MyTuple<K,IEnumerable<T>>.Create(
currentGroupKey,
ReturnSubSequence((x) => {
hasMoreElements = enumerator.MoveNext();
return hasMoreElements &&
x.Equals(keyExtraction(enumerator.Current)); },
enumerator,
currentGroupKey,
enumerator.Current));
}
}
}
static IEnumerable<T> ReturnSubSequence<T,K>(Predicate<K> pred, IEnumerator<T> seq,K currentElement,T first)
{
yield return first;
while( pred(currentElement))
{
yield return seq.Current;
}
}
Using this function we can write:
csharp> numList.MyGroupBy().Select(x => x.Item1);
{ 1, 2, 1, 5 }
csharp> numList.MyGroupBy().Select(x => x.Item2.ToList());
{ { 1 }, { 2, 2, 2 }, { 1, 1, 1 }, { 5, 5 } }
csharp> strList.MyGroupBy(s => s.Length).Select(x => x.Item1);
{ 3, 2, 3, 2 }
csharp> strList.MyGroupBy(s => s.Length).Select(x => x.Item2.ToList());
{ { "abc", "ert", "bre" }, { "sd" }, { "ghj", "awe" }, { "ew", "gh" } }
One interesting fact about this way of writing the
groupby
function is that, you have to be very careful handling the resulting iterator/enumerable. From Python's groupby documentation:The returned group is itself an iterator that shares the underlying iterable with groupby(). Because the source is shared, when the groupby() object is advanced, the previous group is no longer visible
For example in Python, the following effect occurs if we consume the complete iterator before consuming each group:
>>> l = groupby(strList,lambda s: len(s))
>>> consumed = list(groupby(strList,lambda s: len(s)))
>>> [list(g) for k,g in consumed]
[[], ['gh'], [], []]
In our C# version we have a similar restriction, for example:
csharp> var consumed = strList.MyGroupBy(s => s.Length).ToList();
csharp> consumed.Select(x => x.Item2);
{ { "abc" }, { "sd" }, { "ghj" }, { "ew" } }
Code for this post can be found here.