Apparently XML isn’t dead yet, and today I received a Google Product Feed in the RSS 2.0 XML format. The feed was full of duplicates and my job is to remove them:
<?xml version="1.0" encoding="utf-8"?> <rss version="2.0" xmlns:g="http://base.google.com/ns/1.0"> <channel> <item> <g:id>100</g:id> <title>Product 100</title> ... ... </item> <item> <g:id>100</g:id> <title>Product 100</title> ... ... </item> <item> <g:id>200</g:id> <title>Product 200</title> ... ... </item> <item> <g:id>300</g:id> <title>Product 300</title> ... ... </item> </channel> </rss>
As you can see, “Product 100” appears twice.
THE SOLUTION:
A little LINQ can get you far:
using System.Xml; using System.Xml.Linq; using System.Linq; var document = XDocument.Parse(theXMLString); XNamespace g = "http://base.google.com/ns/1.0"; document.Descendants().Where(node => node.Name == "item"); .GroupBy(node => node.Element(g+"id").Value) .SelectMany(node => node.Skip(1)) .Remove();
HOW IT WORKS:
- document.Descendants().Where(node => node.Name == “item”): Get all elements called “item“
- GroupBy(node => node.Element(g+”id”).Value): Group them by the “g:id” element.
- SelectMany(node => node.Skip(1)): Select every one of them apart from the first one
- Remove(): Delete all that were selected
MORE TO READ:
- XDocument Class from Microsoft
- Remove duplicate from xdocument based on element value from Stackoverflow
- How to remove duplicate data from xml using C# from Codeproject
- Removing Duplicate Element Entries from the XML file using XLinq from forums.asp.net
- Google Product Feed specification from Google
- RSS 2.0 Specification from W3.org
Pingback: C# Working with Namespaces in XDocument and XML | Brian Pedersen's Sitecore and .NET Blog