How to get first occurence of src with HTML Agility Pack


due to invalid formatting of xmls I have, I’m using HTML Agility Pack.
I am parsing for example this feed:

I have an array of these elements (so the "src" are always unique):

<content:encoded><![CDATA[<h2><a href=""><img class="alignnone size-full wp-image-23086" src="" alt="" width="1200" height="409" srcset=" 200w, 300w, 400w,

I want to get only the first url of image from the src attribute, so my expected output should be (an array of urls):


I can output whole img element from ‘content-encoded’ node with:

var images = doc.DocumentNode.SelectNodes(".//*[name()='content:encoded']/img").ToArray();
foreach (var item in images)
          Console.WriteLine("image: " + item.OuterHtml);

Other methods than OuterHtml gives me blank output.

I can also output every img from this string with:

var items = doc.DocumentNode.SelectNodes("//img[@src]").ToArray();
foreach (var image in items)
          Console.WriteLine("img: " + image.Attributes["src"].Value);

I know I have to extract first occurence of "https" from img element.
I’ve tried many xpaths, but I can’t get it. Probably my xpath itself is wrong, but I don’t know how to fix it.

Any help will be very appreciated:), thanks!


I think I got it, with RegEx I just do:

var items = doc.DocumentNode.SelectNodes(".//item").ToArray();
foreach (var item in items)
              string matchString = Regex.Match(item.OuterHtml, "<img.+?src=[\"'](.+?)      [\"'].*?>", RegexOptions.IgnoreCase).Groups[1].Value;
              Console.WriteLine("img: " + matchString);

Answered By – Mi Yahn

Answer Checked By – Clifford M. (AngularFixing Volunteer)

Leave a Reply

Your email address will not be published.