Tuesday, August 14, 2007

First steps to generate sample XML files from XSD with Ruby

As my first non-"hello world" program in Ruby I wanted to create something that was useful for me (or at least something entertaining). A couple of weeks ago I had the necessity of generating a sample file for a given XSD Schema . Eclipse already does something like this, but I thought it was a fun programming exercise.

I wanted to create something that loads the XSD Schema into a object structure that can be queried in order to the determine the elements that will be generated. I didn't look for a existing library that does this because it will be a much better exercise to try to build it myself. However creating something that support the full XSD specification like this or this is a HUGE task so I chose to support only a small subset of it.

For XML parsing and generation, I'm using REXML which is a very nice library for XML manipulation.

The basic strategy for loading the XSD Schema is to create a collection of classes that handles each part of the supported schema features. For example SchemaElement for supporting element declarations and SchemaComplexType was created for supporting the complexType declarations.

Since an XSD Schema is a common XSD document loading each element is done by using a load_from , for example for SchemaElement the load_from method looks like this:

class SchemaElement
def load_from(elementDefinition,prefixes)

@name = elementDefinition.attributes["name"]
if (elementDefinition.attributes["type"]) then
@element_type = Reference.new(elementDefinition.attributes["type"],prefixes)
if (elementDefinition.attributes["substitutionGroup"]) then
@substitution_group = Reference.new(elementDefinition.attributes["substitutionGroup"],prefixes)

elementDefinition.find_all {|e| !e.is_a?(REXML::Text)}.each{|e|
case e.name
when "complexType"
ct = SchemaComplexType.new
@element_type = ct
print ""Warning: ignoring #{e}"


As shown in the load_from method, there're relationships between schema elements, for example the type of the element could be a type defined elsewhere inside this schema or an imported schema. Once the schema is loaded, there's a process that takes the references and replace them with a real reference to the object. For the SchemaElement the solve_references_method looks like this:

class SchemaElement
def solve_references(collection)
if @substitution_group.is_a? (XSDInfo::Reference) then
@substitution_group = collection.get_type(

if @element_type.is_a?(XSDInfo::Reference) then
if(r = collection.get_type(@element_type.namespace,@element_type.name)) then
@element_type = r
print "Not found #{@element_type.namespace}.#{@element_type.name}\n"
if !@solving then
@solving = true
@element_type.solve_references(collection) unless @element_type == nil
@solving = false

Here collection points to a SchemaCollection object that holds all the loaded schemas.

Having all this we can load an XSD Schema and start querying for its parts, for example, we can get the list of attributes that apply to the b tag in the XHTML schema:

$ irb -r xsd/xsd.rb
irb(main):001:0> sc = XSDInfo::SchemaCollection.new
=> #<XSDInfo::SchemaCollection:0xb7b71170>
irb(main):002:0> sc.add_schema XSDInfo::SchemaInformation.new("../xhtml1-strict.xsd")
irb(main):003:0> sc.namespaces.each {|ns| sc[ns].solve_references sc}
=> ["http://www.w3.org/1999/xhtml"]
irb(main):004:0> sc["http://www.w3.org/1999/xhtml"].elements["b"].all_attributes.collect {|x| x.name}
=> ["onkeydown", "onkeypress", "onmouseover", "onkeyup", "onmousemove", "onmouseup", "ondblclick", "onmouseout", "onmousedown", "onclick", "title", "class", "id", "style", "dir", nil, "lang"]

Now, for generating the XML sample we can create a generate_sample for each part of the schema. For example the generate_sample for the SchemaComplexType looks like this:

## Sample Generation

def generate_sample_content(e,context)
atts = all_attributes.select {|x| x.name != nil && rand > 0.7}
atts.each {|att|
sample_length = 1 + (10*rand).to_i
sample_text = (1..sample_length).to_a.collect{ |p|
ltrs = ("a"[0].."z"[0]).to_a
e.attributes[att.name] = sample_text

self.all_content_parts.each {|p| p.generate_sample_content(e,context)}

The value of the attributes must be valid according to its simple type. However this is not supported right now.

Another example for the generate_sample method for the SchemaChoice class is the following:

def generate_sample_content(e,context)
if (@minOccurs == 1 && @maxOccurs == 1) then
element_to_gen = @elements[(rand*@elements.length).to_i]
elsif (@minOccurs == 0 && @maxOccurs == 1) then
element_to_gen = @elements[(rand*@elements.length).to_i]
element_to_gen.generate_sample_content(e,context) unless rand < 0.5
elsif (@maxOccurs == "unbounded") then
(1..(rand * 4).to_i).each {|i|
element_to_gen = @elements[(rand*@elements.length).to_i]
element_to_gen.generate_sample_content(e,context) unless rand < 0.5

Now with all this infrastructure we can generate some sample XML files:

def generate_sample_html_element name
sc = XSDInfo::SchemaCollection.new
sc.add_schema XSDInfo::SchemaInformation.new("../xhtml1-strict.xsd")
sc.namespaces.each {|ns| sc[ns].solve_references sc}
doc = REXML::Document.new
f = File.new("output.xml","w")
doc.elements << sc[sc.namespaces[0]].elements[name].a_sample
return sc

We call:

irb(main):006:0> generate_sample_html_element "b"


<b class="zlxzzyunen" onkeydown="uaqz" onkeypress="kqyqmqn" onmouseover="sevcgov" onkeyup="ezglfa" lang="ckn" ondblclick="gfaskd" onmousedown="jwed" onclick="m">
<del ondblclick="xeepat"/>
<del cite="ymtye" title="wldaeawdi" onmouseover="fnk" id="sd" onmouseup="bfqxp" onkeyup="esyfhq">
<a tabindex="lcofhfti" href="ffuuebwn" title="jxhl" onkeydown="fsdwqt" rev="btbsuhl" onmouseup="zerecv" onkeyup="agwsyz" shape="htswqoew" onmousedown="ny" onclick="hq">
<object codetype="xbzmtvzd" onkeydown="ibsuthweoa" archive="ivav" onkeypress="sbhvtgvds" onmousemove="ll" onmousedown="kgbpgzj" onmouseout="nrpdnipw" classid="qwqzkzd" onclick="cybmhyab" usemap="aubjg"/>

Generation is allways different because we're using the rand function for many parts of the process.

Code for this experiment can be found here.