How to get first occurence of src with HTML Agility Pack

Issue

due to invalid formatting of xmls I have, I’m using HTML Agility Pack.
I am parsing for example this feed: https://www.rioseo.com/feed/

I have an array of these elements (so the "src" are always unique):

<content:encoded><![CDATA[<h2><a href="https://resources.rioseo.com/c/gbp-guide-for-hospit?x=0hTW-s"><img class="alignnone size-full wp-image-23086" src="https://www.rioseo.com/wp-content/uploads/2022/04/Rio_eBook_GBP-Guide-for-Hospitality-Brands_April2022_Hero.jpg" alt="" width="1200" height="409" srcset="https://www.rioseo.com/wp-content/uploads/2022/04/Rio_eBook_GBP-Guide-for-Hospitality-Brands_April2022_Hero-200x68.jpg 200w, https://www.rioseo.com/wp-content/uploads/2022/04/Rio_eBook_GBP-Guide-for-Hospitality-Brands_April2022_Hero-300x102.jpg 300w, https://www.rioseo.com/wp-content/uploads/2022/04/Rio_eBook_GBP-Guide-for-Hospitality-Brands_April2022_Hero-400x136.jpg 400w,

I want to get only the first url of image from the src attribute, so my expected output should be (an array of urls):

{'https://www.rioseo.com/wp-content/uploads/2022/04/Rio_eBook_GBP-Guide-for-Hospitality-Brands_April2022_Hero.jpg',
https://another.url.extracted.from.the.array.of.'content_encoded'}

I can output whole img element from ‘content-encoded’ node with:

var images = doc.DocumentNode.SelectNodes(".//*[name()='content:encoded']/img").ToArray();
foreach (var item in images)
     {
          Console.WriteLine("image: " + item.OuterHtml);
     }

Other methods than OuterHtml gives me blank output.

I can also output every img from this string with:

var items = doc.DocumentNode.SelectNodes("//img[@src]").ToArray();
foreach (var image in items)
     {
          Console.WriteLine("img: " + image.Attributes["src"].Value);
     }

I know I have to extract first occurence of "https" from img element.
I’ve tried many xpaths, but I can’t get it. Probably my xpath itself is wrong, but I don’t know how to fix it.

Any help will be very appreciated:), thanks!

Solution

I think I got it, with RegEx I just do:

var items = doc.DocumentNode.SelectNodes(".//item").ToArray();
foreach (var item in items)
         {
              string matchString = Regex.Match(item.OuterHtml, "<img.+?src=[\"'](.+?)      [\"'].*?>", RegexOptions.IgnoreCase).Groups[1].Value;
              Console.WriteLine("img: " + matchString);
         }

Answered By – Mi Yahn

Answer Checked By – Clifford M. (AngularFixing Volunteer)

Leave a Reply

Your email address will not be published.