Monday, October 25, 2010

Writing Python's groupby in C#

A couple of days ago, while working on a C# program, I had the necessity of grouping contiguous elements from a sequence given a property. A group needs to be created each time the value of the property changes in a similar way to the uniq Unix utility.

Python has a function called groupby which is part of the nice itertools module which does exactly what I want. For example:


>>> strList = ["abc","ert","bre","sd","ghj","awe","ew","gh"]
>>> [list(g) for i,g in groupby(strList,lambda x: len(x))]
[['abc', 'ert', 'bre'], ['sd'], ['ghj', 'awe'], ['ew', 'gh']]
>>> numList = [1,2,2,2,1,1,1,5,5]
>>> [list(g) for i,g in groupby(numList)]
[[1], [2, 2, 2], [1, 1, 1], [5, 5]]



In .NET a class called Enumerable with lots of extension methods to manipulate with sequences (IEnumerable<T>). This class includes an extension method called GroupBy which groups values according to a key. However it behaves more like SQL's Group by in that it considers all the values of the collection. For example:

csharp> using System.Collections.Generic;
csharp> var strList = new List<string>() {"abc","ert","bre","sd","ghj","awe","ew","gh"};

csharp> strList.GroupBy(x => x.Length);
{ { "abc", "ert", "bre", "ghj", "awe" }, { "sd", "ew", "gh" } }

csharp> var numList = new List<int>() {1,2,2,2,1,1,1,5,5};

csharp> numList.GroupBy(x => x);
{ { 1, 1, 1, 1 }, { 2, 2, 2 }, { 5, 5 } }


(The C# examples will be presented using the very useful Mono C# REPL)

Writing the equivalent function in C# turned out to be a nice programming excessive. Also trying to write the function using only "yield return" turned out the more challenging than I thought!.

As a first try we can write this function using intermediate collections to store the partial groups:

using System.Collections.Generic;
using System;
namespace Langexplr.Experiments
{
public class MyTuple<T,K>
{
public T Item1 { get; set; }
public K Item2 { get; set; }


public static MyTuple<T1,K1> Create<T1,K1>(T1 first, K1 second)
{
return new MyTuple<T1,K1>() {Item1 = first, Item2 = second};
}
}
public static class GroupByTests
{
public static IEnumerable<MyTuple<K,IList<T>>> MyGroupByWithLists<T,K>(this IEnumerable<T> en,Func<T,K> keyExtraction)
{
K currentKey = default(K);
bool firstTime = true;
IList<T> currentGroup = new List<T>();
foreach(var aValue in en)
{
if (firstTime)
{
currentKey = keyExtraction(aValue);
firstTime = false;
}
else
{
K tmpKey = keyExtraction(aValue);
if (!tmpKey.Equals(currentKey))
{
yield return MyTuple<K,IList<T>>.Create(currentKey, currentGroup);
currentGroup = new List<T>();
currentKey = tmpKey;
}

}
currentGroup.Add(aValue);
}
if (currentGroup.Count > 0)
{
yield return MyTuple<K,IList<T>>.Create(currentKey, currentGroup);
}
}

}
}

Here I'm defining a class called MyTuple which is very similar to .NET 4' Tuple class to store group's key and members.

This function works as expected, for example:

csharp> strList.MyGroupByWithLists(x => x.Length).Select(x => x.Item2).ToList();
{ { "abc", "ert", "bre" }, { "sd" }, { "ghj", "awe" }, { "ew", "gh" } }
csharp> numList.MyGroupByWithLists(x => x).Select(x => x.Item2).ToList();
{ { 1 }, { 2, 2, 2 }, { 1, 1, 1 }, { 5, 5 } }



One of the interesting things about the Python version of groupby is that it doesn't create an intermediate collections for each group. The itertools module reference has the code for the groupby implementation.

Trying to write a this function with similar characteristics in C# resulted in the following (scary) code:


public static IEnumerable<MyTuple<K,IEnumerable<T>>> MyGroupBy<T,K>(this IEnumerable<T> en,Func<T,K> keyExtraction)
{
K currentGroupKey = default(K);
bool firstTime = true;
bool hasMoreElements = false;
bool yieldNewValue = false;

IEnumerator<T> enumerator = en.GetEnumerator();
hasMoreElements = enumerator.MoveNext();
while (hasMoreElements)
{
if (firstTime)
{
firstTime = false;
yieldNewValue = true;
currentGroupKey = keyExtraction(enumerator.Current);
}
else
{
K lastKey;
while((lastKey = keyExtraction(enumerator.Current)).Equals( currentGroupKey) &&
(hasMoreElements = enumerator.MoveNext()))
{

}
if(hasMoreElements &&
!lastKey.Equals(currentGroupKey))
{
currentGroupKey = lastKey;
yieldNewValue = true;
}
else
{
yieldNewValue = false;
}
}

if (yieldNewValue) {
yield return MyTuple<K,IEnumerable<T>>.Create(
currentGroupKey,
ReturnSubSequence((x) => {
hasMoreElements = enumerator.MoveNext();
return hasMoreElements &&
x.Equals(keyExtraction(enumerator.Current)); },
enumerator,
currentGroupKey,
enumerator.Current));
}
}
}
static IEnumerable<T> ReturnSubSequence<T,K>(Predicate<K> pred, IEnumerator<T> seq,K currentElement,T first)
{
yield return first;
while( pred(currentElement))
{
yield return seq.Current;
}
}



Using this function we can write:


csharp> numList.MyGroupBy().Select(x => x.Item1);
{ 1, 2, 1, 5 }
csharp> numList.MyGroupBy().Select(x => x.Item2.ToList());
{ { 1 }, { 2, 2, 2 }, { 1, 1, 1 }, { 5, 5 } }
csharp> strList.MyGroupBy(s => s.Length).Select(x => x.Item1);
{ 3, 2, 3, 2 }
csharp> strList.MyGroupBy(s => s.Length).Select(x => x.Item2.ToList());
{ { "abc", "ert", "bre" }, { "sd" }, { "ghj", "awe" }, { "ew", "gh" } }



One interesting fact about this way of writing the groupby function is that, you have to be very careful handling the resulting iterator/enumerable. From Python's groupby documentation:

The returned group is itself an iterator that shares the underlying iterable with groupby(). Because the source is shared, when the groupby() object is advanced, the previous group is no longer visible


For example in Python, the following effect occurs if we consume the complete iterator before consuming each group:

>>> l = groupby(strList,lambda s: len(s))
>>> consumed = list(groupby(strList,lambda s: len(s)))
>>> [list(g) for k,g in consumed]
[[], ['gh'], [], []]


In our C# version we have a similar restriction, for example:


csharp> var consumed = strList.MyGroupBy(s => s.Length).ToList();
csharp> consumed.Select(x => x.Item2);
{ { "abc" }, { "sd" }, { "ghj" }, { "ew" } }


Code for this post can be found here.

Wednesday, October 13, 2010

Using C#'s implicit type conversions from other .NET languages

One interesting C# feature is the ability to define a method that implements implicit conversion from one type to another. In this post I'm going to show how to use this feature from IronPython, F#, VB.NET and IronRuby.

Example



In order to illustrate the implicit conversion feature we're going to use the following classes:


namespace Langexplr.Experiments
{
public class Complex
{
public double Real { get; set; }
public double Img { get; set; }

public static implicit operator Complex(double real)
{
return new Complex() { Real = real };
}

public static implicit operator Polar(Complex complex)
{
return new Polar() { Angle = Math.Atan(complex.Img/complex.Real),
Length = Math.Sqrt(complex.Real*complex.Real +
complex.Img*complex.Img) };
}

public static implicit operator double(Complex complex)
{
return Math.Sqrt(complex.Real*complex.Real +
complex.Img*complex.Img);
}

}

public class Polar
{
public double Angle { get; set; }
public double Length { get; set; }
}
}



The Complex class is a simple definition of a complex number. The Polar class is defined(conveniently) to represent a complex number in polar form. The Complex class defines three implicit conversions:

  1. From double to a complex number

  2. From Complex to Polar

  3. From Complex to double


The following C# code shows a use of this feature:


using Langexplr.Experiments;
using System;

class main
{
public static void Main(string[] args)
{
Complex c = 10.3;
Polar p = new Complex() {Real = 12.3, Img = 5.2};
double abs = c;

Console.WriteLine("abs:{0} Polar: {1},{2}", abs ,p.Angle ,p.Length);
}
}


By looking at the definitions generated by the compiler for the Complex class, we can see several definitions for the op_Implicit method with different parameters and return types.

...
.method public hidebysig specialname static
class Langexplr.Experiments.Complex
op_Implicit(float64 real) cil managed
...
.method public hidebysig specialname static
class Langexplr.Experiments.Polar
op_Implicit(class Langexplr.Experiments.Complex complex) cil managed
...
.method public hidebysig specialname static
float64 op_Implicit(class Langexplr.Experiments.Complex complex) cil managed
...



Now these uses of the Complex class will be presented on different .NET languages.

IronPython



As described in "Dark Corners of IronPython" by Michael Foord the clr.Convert function can be used to convert between types using the op_Implicit if necessary.

For example:


import clr
clr.AddReference("ImplicitTest")

from Langexplr.Experiments import *
from System import Double

c = clr.Convert(10.3, Complex)
nC = Complex()
nC.Real = 12.3
nC.Img = 5.2
p = clr.Convert(nC, Polar)
abs = clr.Convert(c, Double)

print 'abs: %(0)f Polar: %(1)f,%(2)f\n' % \
{ '0': abs, '1' : p.Angle, '2' : p.Length }


IronRuby



IronRuby will use the op_Implicit definition if a conversion required at a particular call. I couldn't find a nice way to do this directly as with IronPython's clr.Convert . However the following function definition seems to do the trick:


def dotnet_convert(value,type)
f = System::Func[type,type].new {|x| x}
f.invoke(value)
end


This conversion function works since IronRuby tries to convert the value to the expected .NET type in the call to 'invoke' .

Using this definition we can write:


require 'ImplicitTest.dll'

c = dotnet_convert(10.3,Langexplr::Experiments::Complex)
nC = Langexplr::Experiments::Complex.new
nC.Real = 12.3
nC.Img = 5.2
p = dotnet_convert(nC,Langexplr::Experiments::Polar)
abs = dotnet_convert(c,System::Double)

print "abs: #{abs} Polar: #{p.Angle},#{p.Length} \n"



F#



In F# we can call the op_Implicit method directly and F# will use type inference to determine the correct overload to use.

For example:

open Langexplr.Experiments

let c : Complex = Complex.op_Implicit 10.3
let p : Polar = Complex.op_Implicit (new Complex(Real=12.3, Img=5.2))
let abs : double = Complex.op_Implicit c

System.Console.WriteLine("1. {0} Polar: {1},{2} ", abs, p.Angle, p.Length )



There's a nice post called "F# – Duck Typing and Structural Typing" by Matthew Podwysocki, which describes a nice way to define a generic function to use the op_Implicit operator.


let inline convert (x:^a) : ^b = ((^a or ^b) : (static member op_Implicit : ^a -> ^b) x )


This function can be used as follows:


let c2:Complex = convert 10.3
let p2:Polar = convert (new Complex(Real=12.3, Img=5.2))
let abs2:float = convert c

System.Console.WriteLine("2. {0} Polar: {1},{2} ",abs2,p2.Angle,p2.Length)


VB.NET



Finally in Visual Basic .NET the implicit conversion is used automatically as in C#. For example:


Imports System
Imports Langexplr.Experiments
Module Test
Sub Main
Dim c As Complex = 10.3
Dim p As Polar = new Complex() With { _
.Real = 12.3, _
.Img = 5.2 _
}
Dim abs As Double = c
Console.WriteLine("abs:{0} Polar: {1},{2}",abs,p.Angle,p.Length)
End Sub
End Module