How to remove HTML content but keep line breaks from tags? #576

ChrisMBarr · 2023-05-20T03:32:33Z

ChrisMBarr
May 20, 2023

I am working on something that will parse OpenLyrics files. Keep in mind that all I care about extracting are the song lyrics. I am running into some issues dealing with HTML tags that are sometimes used as line breaks.

These are my FXP options

{
  ignoreAttributes: false,
  attributeNamePrefix: '',
  parseAttributeValue: true,
  isArray: (_name, jPath) => jPath === 'verse.lines',
}

If I have the following XML, simplified for this example

<verse name="v1">
  <lines>Amazing grace how sweet the sound<br/>that saved a wretch like me</lines>
</verse>

It produces this result:

"verse": {
  "lines": [
    {
      "br": "",
      "#text": "Amazing grace how sweet the soundthat saved a wretch like me;"
    }
  ],
  "name": "v1"
}

I love that I just get the text back, however that   tag is removed from the text and no line break is present, so the text gets smushed together. I then found the stopNodes option, which I added like so: stopNodes: ['verse.lines'] which actually works great for this exact case producing this:

"#text": "Amazing grace how sweet the sound<br/>that saved a wretch like me;"

I can easily work around that and replace with real line breaks later. However, another exmaple file in this format has the following:

<verse name="v2" lang="en-US">
  <lines part="men"><comment>any text</comment><br/>Amazing grace how sweet the sound that saved a wretch like me;<br/><comment>any text</comment><br/><chord name="D"/>Amazing grace how sweet <chord name="D"/>the sound that saved a wretch like me;<chord name="B7"/><br/>Amazing grace<chord name="G7"/> how sweet the sound that saved a wretch like me;</lines>
  <lines part="women">A b c<br/><br/>D e f</lines>
</verse>

Which produces this:

 "verse": {
    "lines": [
      {
        "#text": "<comment>any text</comment><br/>Amazing grace how sweet the sound that saved a wretch like me;<br/><comment>any text</comment><br/><chord name=\"D\"/>Amazing grace how sweet <chord name=\"D\"/>the sound that saved a wretch like me;<chord name=\"B7\"/><br/>Amazing grace<chord name=\"G7\"/> how sweet the sound that saved a wretch like me;",
        "part": "men"
      },
      {
        "#text": "A b c<br/><br/>D e f",
        "part": "women"
      }
  ],
  "name": "v2",
  "lang": "en-US"
},

So, unless I am missing something here it appears as if I will need to leave these options as they are and then figure out my own way of removing all these other tags from the #text string on my own, and replace   with \n. I know that tagValueProcessor exists, I tried it and it doesn't really suit my needs exactly. Is there a better way to achieve what I need?

Answered by ChrisMBarr

May 24, 2023

As a follow up I wanted to share my solution. Please advise if there is a better way to handle this.

I used these options:

{
  ignoreAttributes: false,
  ignoreDeclaration: true,
  attributeNamePrefix: '',
  parseAttributeValue: true,
  stopNodes: ['verse.lines'],
  tagValueProcessor: (_tagName, tagValue, jPath): string | null => {
    return jPath === 'verse.lines' ? tagValue : null;
  },
}

And after parsing, that will now always return a string of text and HTML/XML nodes inside of each <lines> node, which I process with the below method.

private convertHtmlLineBreaksAndStripTags(str: string): string {
  return (
    str
      //replace correctly and incorrectly formatted <br> </br> and …

View full answer

ChrisMBarr · 2023-05-24T13:31:50Z

ChrisMBarr
May 24, 2023
Author

As a follow up I wanted to share my solution. Please advise if there is a better way to handle this.

I used these options:

{
  ignoreAttributes: false,
  ignoreDeclaration: true,
  attributeNamePrefix: '',
  parseAttributeValue: true,
  stopNodes: ['verse.lines'],
  tagValueProcessor: (_tagName, tagValue, jPath): string | null => {
    return jPath === 'verse.lines' ? tagValue : null;
  },
}

And after parsing, that will now always return a string of text and HTML/XML nodes inside of each <lines> node, which I process with the below method.

private convertHtmlLineBreaksAndStripTags(str: string): string {
  return (
    str
      //replace correctly and incorrectly formatted <br> </br> and </br> tags with new lines
      //Sometimes these will already have a newline after them, remove that so that newlines aren't doubled
      .replace(/<\/?br\/?>(\n)?/gi, '\n')
      //Then remove all remaining HTML/XML tags and their content
      .replace(/(<[^/]+?>.+?<\/.+?>)|(<[^/]+?\/>)/g, '')
  );
}

So, that works for me for now at least.

0 replies

amitguptagwl · 2023-05-25T05:08:15Z

amitguptagwl
May 25, 2023
Maintainer

You can try updateTag option.

0 replies

ChrisMBarr · 2023-05-26T02:54:10Z

ChrisMBarr
May 26, 2023
Author

Well, that seems to work better in theory, however it has some side effects I wasn't expecting. Given this XML

<verse name="v2" lang="en-US">
  <lines part="men"><comment>any text</comment><br/>Amazing grace how sweet the sound that saved a wretch like me;<br/><comment>any text</comment><br/><chord name="D"/>Amazing grace how sweet <chord name="D"/>the sound that saved a wretch like me;<chord name="B7"/><br/>Amazing grace<chord name="G7"/> how sweet the sound that saved a wretch like me;</lines>
  <lines part="women">A b c<br/><br/>D e f</lines>
</verse>

with these settings

{
  ignoreAttributes: false,
  ignoreDeclaration: true,
  attributeNamePrefix: '',
  parseAttributeValue: true,
  updateTag: (tagName, jPath): string | boolean => {
    if (jPath === 'verse.lines.chord' || jPath === 'verse.lines.comment') {
      return false;
    }
    return true;
},

It produces this object:

[
    {
        "br": [ "", "", "", "" ],
        "#text": "Amazing grace how sweet the sound that saved a wretch like me;Amazing grace how sweetthe sound that saved a wretch like me;Amazing gracehow sweet the sound that saved a wretch like me;",
        "part": "men"
    },
    {
        "br": [ "", "" ],
        "#text": "A b cD e f",
        "part": "women"
    }
]

As you can see it skips the   tags and smushes the text together, and then it give an array of empty strings for a "br" property.
I would have assumed it would just totally skip the nodes I was explicit about and leave the   nodes alone. I think I'll have to stick with my original way of working with this.

0 replies

amitguptagwl · 2023-05-27T00:56:04Z

amitguptagwl
May 27, 2023
Maintainer

Try return;, instead of return false;

2 replies

ChrisMBarr May 27, 2023
Author

Not possible in a TypeScript project with the current typedefs for that method which says the updateTag method can only have a string | boolean return type. If what you are saying is true that needs to be string | boolean | undefined

ChrisMBarr May 28, 2023
Author

I have made a PR to change the typedefs
#579

amitguptagwl · 2023-05-28T15:29:17Z

amitguptagwl
May 28, 2023
Maintainer

Thanks. But I checked and it doesn't need a fix. I'm working on a major version change. I'll test this point there.

1 reply

ChrisMBarr May 28, 2023
Author

ah ok, even better! Be sure to check that PR where I give the example output. What it produces feels wrong to me, but maybe I'm missing something.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to remove HTML content but keep line breaks from <br/> tags? #576

{{title}}

Replies: 5 comments 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

How to remove HTML content but keep line breaks from <br/> tags? #576

ChrisMBarr May 20, 2023

Replies: 5 comments · 3 replies

ChrisMBarr May 24, 2023 Author

amitguptagwl May 25, 2023 Maintainer

ChrisMBarr May 26, 2023 Author

amitguptagwl May 27, 2023 Maintainer

ChrisMBarr May 27, 2023 Author

ChrisMBarr May 28, 2023 Author

amitguptagwl May 28, 2023 Maintainer

ChrisMBarr May 28, 2023 Author

ChrisMBarr
May 20, 2023

Replies: 5 comments 3 replies

ChrisMBarr
May 24, 2023
Author

amitguptagwl
May 25, 2023
Maintainer

ChrisMBarr
May 26, 2023
Author

amitguptagwl
May 27, 2023
Maintainer

ChrisMBarr May 27, 2023
Author

ChrisMBarr May 28, 2023
Author

amitguptagwl
May 28, 2023
Maintainer

ChrisMBarr May 28, 2023
Author