The problems with the XML are the '&' in the source tags. A simple find and replace of '&' for '&' should do the trick, which is what I did to parse the XML file.
With regards to normalisation, my problem was where to stop. I eventually settled on class, domain, level, and school mainly because...