How can I convert HTML to Object structure with text and formatting?

Issue

I need to convert a HTML String with nested Tags like this one:

const strHTML = "<p>Hello World</p><p>I am a text with <strong>bold</strong> word</p><p><strong>I am bold text with nested <em>italic</em> Word.</strong></p>"

Into the following Array of objects with this Structure:

const result = [{
  text: "Hello World",
  format: null
}, {
  text: "I am a text with",
  format: null
}, {
  text: "bold",
  format: ["strong"]
}, {
  text: " word",
  format: null
}, {
  text: "I am a text with nested",
  format: ["strong"]
}, {
  text: "italic",
  format: ["strong", "em"]
}, {
  text: "Word.",
  format: ["strong"]
}];

I managed the conversion with the DOMParser() as long as there are no nested Tags. I am not able to get it running with nested Tags, like in the last paragraph, so my whole paragraph is bold, but the word "italic" should be both bold and italic. I cannot get it running as a recursion.

Any help would be appreciated.

So the code I wrote so far is this one:

export interface Phrase {
    text: string;
    format: string | string[];
}

export class HTMLParser {

    public parse(text: string): void {
        const parser = new DOMParser();
        const sourceDocument = parser.parseFromString(text, "text/html");
        this.parseChildren(sourceDocument.body.childNodes);

        // HERE SHOULD BE the result
        console.log("RESULT of CONVERSION", this.phrasesProcessed);
    }

    public phrasesProcessed: Phrase[] = [];

    private parseChildren(toParse: NodeListOf<ChildNode>) {
        this.phrasesProcessed = [];
        try {
            Array.from(toParse)
                .map(item => {
                    if (item.nodeType === Node.ELEMENT_NODE && item instanceof HTMLElement) {
                        return Array.from(item.childNodes).map(child => ({ text: child.textContent, format: (child.nodeType === Node.ELEMENT_NODE && child instanceof HTMLElement) ? child.tagName : null }));
                    } else {
                        return Array.from(item.childNodes).map(child => ({ text: child.textContent, format: null }));
                    }
                })
                .filter(line => line.length) // only non emtpy arrays
                .map(element => ([...element, { text: "\n", format: null }])) // add linebreak after each P
                .reduce((acc: (Phrase)[], val) => acc.concat(val), []) // flatten
                .forEach(
                    element => {
                        // console.log("ELEMENT", element);
                        this.phrasesProcessed.push(element);
                    }
                );
        } catch (e) {
            console.warn(e);
        }
    }

}

Solution

You can use recursion. And this seems a good case for a generator function. As it was not clear which tags should be retained in format (apparently, not p), I left this as a configuration to provide:

const formatTags = new Set(["b", "big", "code", "del", "em", "i", "pre", "s", "small", "strike", "strong", "sub", "sup", "u"]);

function* iterLeafNodes(nodes, format=[]) {
    for (let node of nodes) {
        if (node.nodeType == 3) {
            yield ({text: node.nodeValue, format: format.length ? [...format] : null});
        } else {
            const tag = node.tagName.toLowerCase();
            yield* iterLeafNodes(node.childNodes, 
                                 formatTags.has(tag) ? format.concat(tag) : format);
        }
    }
}

// Example input
const strHTML = "<p>Hello World</p><p>I am a text with <strong>bold</strong> word</p><p><strong>I am bold text with nested <em>italic</em> Word.</strong></p>"
const nodes = new DOMParser().parseFromString(strHTML, 'text/html').body.childNodes;
let result = [...iterLeafNodes(nodes)];

console.log(result);

Note that this will still split the text when it is spread over multiple tags, which are considered non-formatting tags, like span.

Secondly, I’m not convinced that having null as a possible value for format is more useful then just an empty array [], but anyway, the above produces null in that case.

Special case – insertion of \n

In comments you ask for the insertion of a line break after each p element.

The code below will generate that extra element. Here I also used [] instead of null for format:

const formatTags = new Set(["b", "big", "code", "del", "em", "i", "pre", "s", "small", "strike", "strong", "sub", "sup", "u"]);

function* iterLeafNodes(nodes, format=[]) {
    for (let node of nodes) {
        if (node.nodeType == 3) {
            yield ({text: node.nodeValue, format: [...format]});
        } else {
            const tag = node.tagName.toLowerCase();
            yield* iterLeafNodes(node.childNodes, 
                                 formatTags.has(tag) ? format.concat(tag) : format);
            if (tag === "p") yield ({text: "\n", format: [...format]});
        }
    }
}

// Example input
const strHTML = "<p>Hello World</p><p>I am a text with <strong>bold</strong> word</p><p><strong>I am bold text with nested <em>italic</em> Word.</strong></p>"
const nodes = new DOMParser().parseFromString(strHTML, 'text/html').body.childNodes;
let result = [...iterLeafNodes(nodes)];

console.log(result);

Answered By – trincot

Answer Checked By – Mary Flores (AngularFixing Volunteer)

Leave a Reply

Your email address will not be published.