How to remove HTML content but keep line breaks from <br/> tags? #576
-
I am working on something that will parse OpenLyrics files. Keep in mind that all I care about extracting are the song lyrics. I am running into some issues dealing with HTML tags that are sometimes used as line breaks. These are my FXP options {
ignoreAttributes: false,
attributeNamePrefix: '',
parseAttributeValue: true,
isArray: (_name, jPath) => jPath === 'verse.lines',
} If I have the following XML, simplified for this example <verse name="v1">
<lines>Amazing grace how sweet the sound<br/>that saved a wretch like me</lines>
</verse> It produces this result: "verse": {
"lines": [
{
"br": "",
"#text": "Amazing grace how sweet the soundthat saved a wretch like me;"
}
],
"name": "v1"
} I love that I just get the text back, however that
I can easily work around that and replace with real line breaks later. However, another exmaple file in this format has the following: <verse name="v2" lang="en-US">
<lines part="men"><comment>any text</comment><br/>Amazing grace how sweet the sound that saved a wretch like me;<br/><comment>any text</comment><br/><chord name="D"/>Amazing grace how sweet <chord name="D"/>the sound that saved a wretch like me;<chord name="B7"/><br/>Amazing grace<chord name="G7"/> how sweet the sound that saved a wretch like me;</lines>
<lines part="women">A b c<br/><br/>D e f</lines>
</verse> Which produces this: "verse": {
"lines": [
{
"#text": "<comment>any text</comment><br/>Amazing grace how sweet the sound that saved a wretch like me;<br/><comment>any text</comment><br/><chord name=\"D\"/>Amazing grace how sweet <chord name=\"D\"/>the sound that saved a wretch like me;<chord name=\"B7\"/><br/>Amazing grace<chord name=\"G7\"/> how sweet the sound that saved a wretch like me;",
"part": "men"
},
{
"#text": "A b c<br/><br/>D e f",
"part": "women"
}
],
"name": "v2",
"lang": "en-US"
}, So, unless I am missing something here it appears as if I will need to leave these options as they are and then figure out my own way of removing all these other tags from the |
Beta Was this translation helpful? Give feedback.
Replies: 5 comments 3 replies
-
As a follow up I wanted to share my solution. Please advise if there is a better way to handle this. I used these options: {
ignoreAttributes: false,
ignoreDeclaration: true,
attributeNamePrefix: '',
parseAttributeValue: true,
stopNodes: ['verse.lines'],
tagValueProcessor: (_tagName, tagValue, jPath): string | null => {
return jPath === 'verse.lines' ? tagValue : null;
},
} And after parsing, that will now always return a string of text and HTML/XML nodes inside of each private convertHtmlLineBreaksAndStripTags(str: string): string {
return (
str
//replace correctly and incorrectly formatted <br> </br> and </br> tags with new lines
//Sometimes these will already have a newline after them, remove that so that newlines aren't doubled
.replace(/<\/?br\/?>(\n)?/gi, '\n')
//Then remove all remaining HTML/XML tags and their content
.replace(/(<[^/]+?>.+?<\/.+?>)|(<[^/]+?\/>)/g, '')
);
} So, that works for me for now at least. |
Beta Was this translation helpful? Give feedback.
-
You can try |
Beta Was this translation helpful? Give feedback.
-
Well, that seems to work better in theory, however it has some side effects I wasn't expecting. Given this XML <verse name="v2" lang="en-US">
<lines part="men"><comment>any text</comment><br/>Amazing grace how sweet the sound that saved a wretch like me;<br/><comment>any text</comment><br/><chord name="D"/>Amazing grace how sweet <chord name="D"/>the sound that saved a wretch like me;<chord name="B7"/><br/>Amazing grace<chord name="G7"/> how sweet the sound that saved a wretch like me;</lines>
<lines part="women">A b c<br/><br/>D e f</lines>
</verse> with these settings {
ignoreAttributes: false,
ignoreDeclaration: true,
attributeNamePrefix: '',
parseAttributeValue: true,
updateTag: (tagName, jPath): string | boolean => {
if (jPath === 'verse.lines.chord' || jPath === 'verse.lines.comment') {
return false;
}
return true;
}, It produces this object: [
{
"br": [ "", "", "", "" ],
"#text": "Amazing grace how sweet the sound that saved a wretch like me;Amazing grace how sweetthe sound that saved a wretch like me;Amazing gracehow sweet the sound that saved a wretch like me;",
"part": "men"
},
{
"br": [ "", "" ],
"#text": "A b cD e f",
"part": "women"
}
] As you can see it skips the |
Beta Was this translation helpful? Give feedback.
-
Try |
Beta Was this translation helpful? Give feedback.
-
Thanks. But I checked and it doesn't need a fix. I'm working on a major version change. I'll test this point there. |
Beta Was this translation helpful? Give feedback.
As a follow up I wanted to share my solution. Please advise if there is a better way to handle this.
I used these options:
And after parsing, that will now always return a string of text and HTML/XML nodes inside of each
<lines>
node, which I process with the below method.