There are a number of ways to analyze XML documents; such as the following.
<?xml version="1.0" encoding="UTF-8"?>
<User xmlns="xmlns://www.myschema.org"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:type="User">
<Addresses>
<Address Type="Home">
<City>MyCity</City>
</Address>
<Address Type="Work">
<City>MyWorkCity</City>
</Address>
</Addresses>
</User>
There is code like this that uses XML:
using System.Xml.Linq;
XElement root = XElement.Load("User.xml");
var citiesUsingStatement =
from el in root.Element("Addresses")?.Elements("Address")
where (string)el.Attribute("Type") == "Home"
select el.Element("City").Value;
// or
var citiesUsingLinq = root.Element("Addresses")?.Elements("Address")
.Where(el => (string)el.Attribute("Type") == "Home")
.Select(el => el.Element("City").Value);
This approach works, for now, but it can fail pretty un-gracefully if the XML is not populated correctly.
Another approach uses serializable objects (and it fails a lot more gracefully if it fails):
using System.Xml.Serialization;
User user;
using (var userXML = XmlReader.Create("User.xml"))
{
user = (User)(new XmlSerializer(typeof(User))).Deserialize(userXML);
}
var citiesUsingLinq = user.Addresses.Where(a => a.Type == "Home").Select(a => a.City);
As seen, the code using serialization is more concise, has more compile-time error checking, and will not produce meaningless results if an attacker provides an XML document that does not match the schema. This approach is favored in industry. But how do we get there from an XML document?
We begin by using infer schema to turn the XML into a schema file.
using System;
using System.IO;
using System.Xml;
using System.Xml.Schema;
XmlSchemaSet schemaSet = new XmlSchemaSet();
XmlReader reader = XmlReader.Create("User.xml");
XmlSchemaInference inference = new XmlSchemaInference();
schemaSet = inference.InferSchema(reader, schemaSet);
using (var stringWriter = new FileStream("User.xsd", FileMode.CreateNew))
{
(schemaSet.Schemas() as ICollection<XmlSchema>)?.FirstOrDefault()?.Write(stringWriter);
}
This produces the following User.xsd schema
<?xml version="1.0"?>
<xs:schema xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
attributeFormDefault="unqualified"
elementFormDefault="qualified"
targetNamespace="xmlns://www.myschema.org"
xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="User">
<xs:complexType>
<xs:sequence>
<xs:element name="Addresses">
<xs:complexType>
<xs:sequence>
<xs:element maxOccurs="unbounded" name="Address">
<xs:complexType>
<xs:sequence>
<xs:element name="City" type="xs:string" />
</xs:sequence>
<xs:attribute name="Type" type="xs:string" use="required" />
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema>
Now xsd can be run on the schema.
xsd /c User.xsd /nologo > User.cs
Running xsd will output User.cs and should appear similar to (extra attributes removed):
using System.Xml.Serialization;
[System.SerializableAttribute()]
[System.Xml.Serialization.XmlTypeAttribute(AnonymousType=true, Namespace="xmlns://www.myschema.org")]
[System.Xml.Serialization.XmlRootAttribute(Namespace="xmlns://www.myschema.org", IsNullable=false)]
public partial class User {
[System.Xml.Serialization.XmlArrayItemAttribute("Address", IsNullable=false)]
public Address[] Addresses {get; set;}
}
[System.SerializableAttribute()]
[System.Xml.Serialization.XmlTypeAttribute(AnonymousType=true, Namespace="xmlns://www.myschema.org")]
public partial class Address {
public string City {get; set;}
[System.Xml.Serialization.XmlAttributeAttribute()]
public string Type {get; set;}
}
Now it is time to serialize!
using System.Xml.Serialization;
User user;
using (var userXML = XmlReader.Create("User.xml"))
{
user = (User)(new XmlSerializer(typeof(User))).Deserialize(userXML);
}
Oh no! What went wrong?
It turns out that the xsi:type="User" from earlier is causing the headache. Since there are probably more XML files of this format, we won't edit the XML. Instead, fix it in code (as follows)!
public partial class User : User_base {}
[System.Xml.Serialization.XmlRootAttribute(Namespace="xmlns://www.myschema.org", IsNullable = false, ElementName = "User")]
[System.Xml.Serialization.XmlTypeAttribute(Namespace="xmlns://www.myschema.org")]
[XmlIncludeAttribute(typeof(User))]
public abstract class User_base{}
Note that an abstract class cannot be serialized, so it looks to serialize an included type instead. Since there is only one included type, the User type, a User object will be produced and so a DownCast (User_base to User) will succeed.
Now the following code should deserialize and produce the correct results:
using System.Xml.Serialization;
User user;
using (var userXML = XmlReader.Create("User.xml"))
{
user = (User)(new XmlSerializer(typeof(User_base))).Deserialize(userXML);
}
var citiesUsingLinq = user.Addresses.Where(a => a.Type == "Home").Select(a => a.City);