Skip to content

Commit

Permalink
Supports HTML Doc parsing
Browse files Browse the repository at this point in the history
  • Loading branch information
amitguptagwl committed Dec 1, 2021
1 parent 7439272 commit 6f10c26
Show file tree
Hide file tree
Showing 9 changed files with 327 additions and 18 deletions.
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ Check [ThankYouBackers](https://github.com/NaturalIntelligence/ThankYouBackers)
<a href="http://nasa.github.io/" title="NASA" > <img src="https://avatars0.githubusercontent.com/u/848102" width="60px" ></a>
<a href="https://github.com/prettier" title="Prettier" > <img src="https://avatars0.githubusercontent.com/u/25822731" width="60px" ></a>
<a href="http://brain.js.org/" title="brain.js" > <img src="https://avatars2.githubusercontent.com/u/23732838" width="60px" ></a>
<a href="https://github.com/aws" title="AWS SDK" > <img src="https://avatars.githubusercontent.com/u/2232217" width="60px" ></a>
<a href="#" title="NHS Connect" > <img src="https://avatars3.githubusercontent.com/u/20316669" width="60px" ></a>
<a href="http://www.fda.gov/" title="Food and Drug Administration " > <img src="https://avatars2.githubusercontent.com/u/6471964" width="60px" ></a>
<a href="http://www.magento.com/" title="Magento" > <img src="https://avatars2.githubusercontent.com/u/168457" width="60px" ></a>
Expand Down Expand Up @@ -106,7 +107,7 @@ In a HTML page
3. [XML Builder](./docs/v4/3.XMLBuilder.md)
4. [XML Validator](./docs/v4/4.XMLValidator.md)
5. [Entites](./docs/5.Entities.md)

6. [HTML Document Parsing](./docs/6.HTMLParsing.md)
## Performance

### XML Parser
Expand Down
5 changes: 5 additions & 0 deletions docs/v4/3.XMLBuilder.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,11 @@ When you parse a XML using XMLParser with `preserveOrder: true`, the result JS o
## processEntities
Set it to `true` (default) to process XML entities. Check [Entities](./5.Entities.md) section for more detail. If you don't have entities in your XML document then it is recommanded to disable it `processEntities: false` for better performance.

## stopNodes
As you set `stopNodes` to the XML parser configuration to avoid parsing and processing of any tag, you can set it builder to avoid parsing and entity processing. Check [HTML Document Parsing](./6.HTMLParsing.md) for more detail.

This property is currently supported with `preserveOrder: true` option only.

## suppressEmptyNode
Tags with no text value would be parsed as empty tags.
Input
Expand Down
4 changes: 3 additions & 1 deletion docs/v4/5.Entities.md
Original file line number Diff line number Diff line change
Expand Up @@ -132,4 +132,6 @@ Following HTML entities are supported by the parser by default when `htmlEntitie
|| Indian Rupee | &inr; | &#8377; |
---

In future version of FXP, we'll be supporting more features of DOCTYPE such as `ELEMENT`, reading content for an entity from a file etc.
In future version of FXP, we'll be supporting more features of DOCTYPE such as `ELEMENT`, reading content for an entity from a file etc.

[> Next: HTML Document Parsing](./6.HTMLParsing.md)
204 changes: 204 additions & 0 deletions docs/v4/6.HTMLParsing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,204 @@
FXP supports parsing of HTML document. Here is an example;

Input HTML Document
```html
<!DOCTYPE html>
<html lang="en">
<head>
<script>
window.dataLayer = window.dataLayer || [];
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
gtag('config', 'UA-80202630-2');
</script>

<title>Fast XMl Parser</title>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<link rel="stylesheet" href="static/css/bootstrap.min.css">
<link rel="stylesheet" href="static/css/jquery-confirm.min.css">
<link rel="stylesheet" type="text/css" href="style.css">

<script src="static/js/jquery-3.2.1.min.js"></script>
<style>
.CodeMirror{
height: 100%;
width: 100%;
}
</style>
</head>
<body role="document" style="background-color: #2c3e50;">
<h1>Heading</h1>
<hr>
<h2>&inr;</h2>
<pre>
<h1>Heading</h1>
<hr>
<h2>&inr;</h2>
</pre>
<script>
let highlightedLine = null;
let editor;
<!-- this should not be parsed separately -->
function updateLength(){
const xmlData = editor.getValue();
$("#lengthxml")[0].innerText = xmlData.replace(/>\s*</g, "><").length;
}
</script>
</body>
</html>
```

Code and necessary configuration to parse it to JS object.

```js
const parsingOptions = {
ignoreAttributes: false,
// preserveOrder: true,
unpairedTags: ["hr", "br", "link", "meta"],
stopNodes : [ "*.pre", "*.script"],
processEntities: true,
htmlEntities: true
};
const parser = new XMLParser(parsingOptions);
parser.parse(html);
```

JS Object
```json
{
"html": {
"head": {
"script": [
"\n window.dataLayer = window.dataLayer || [];\n function gtag(){dataLayer.push(arguments);}\n gtag('js', new Date());\n \n gtag('config', 'UA-80202630-2');\n ",
{
"@_src": "static/js/jquery-3.2.1.min.js"
}
],
"title": "Fast XMl Parser",
"meta": [
{
"@_charset": "UTF-8"
},
{
"@_name": "viewport",
"@_content": "width=device-width, initial-scale=1"
}
],
"link": [
{
"@_rel": "stylesheet",
"@_href": "static/css/bootstrap.min.css"
},
{
"@_rel": "stylesheet",
"@_href": "static/css/jquery-confirm.min.css"
},
{
"@_rel": "stylesheet",
"@_type": "text/css",
"@_href": "style.css"
}
],
"style": ".CodeMirror{\n height: 100%;\n width: 100%;\n }"
},
"body": {
"h1": "Heading",
"hr": "",
"h2": "",
"pre": "\n <h1>Heading</h1>\n <hr>\n <h2>&inr;</h2>\n ",
"script": "\n let highlightedLine = null;\n let editor;\n <!-- this should not be parsed separately -->\n function updateLength(){\n const xmlData = editor.getValue();\n $(\"#lengthxml\")[0].innerText = xmlData.replace(/>s*</g, \"><\").length;\n }\n ",
"@_role": "document",
"@_style": "background-color: #2c3e50;"
},
"@_lang": "en"
}
}
```

To build the HTML document back from JS object, you need to uncomment `preserveOrder: true` in above code. And pass the output to the XML builder;
```js
const parsingOptions = {
ignoreAttributes: false,
preserveOrder: true,
unpairedTags: ["hr", "br", "link", "meta"],
stopNodes : [ "*.pre", "*.script"],
processEntities: true,
htmlEntities: true
};
const parser = new XMLParser(parsingOptions);
let result = parser.parse(html);

const builderOptions = {
ignoreAttributes: false,
format: true,
preserveOrder: true,
suppressEmptyNode: true,
unpairedTags: ["hr", "br", "link", "meta"],
stopNodes : [ "*.pre", "*.script"],
}
const builder = new XMLBuilder(builderOptions);
const output = builder.build(result);
```

Output
```html
<html lang="en">
<head>
<script>
window.dataLayer = window.dataLayer || [];
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
gtag('config', 'UA-80202630-2');
</script>
<title>
Fast XMl Parser
</title>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<link rel="stylesheet" href="static/css/bootstrap.min.css">
<link rel="stylesheet" href="static/css/jquery-confirm.min.css">
<link rel="stylesheet" type="text/css" href="style.css">
<script src="static/js/jquery-3.2.1.min.js">
</script>
<style>
.CodeMirror{
height: 100%;
width: 100%;
}
</style>
</head>
<body role="document" style="background-color: #2c3e50;">
<h1>
Heading
</h1>
<hr>
<h2>
</h2>
<pre>

<h1>Heading</h1>
<hr>
<h2>&inr;</h2>

</pre>
<script>
let highlightedLine = null;
let editor;
<!-- this should not be parsed separately -->
function updateLength(){
const xmlData = editor.getValue();
$("#lengthxml")[0].innerText = xmlData.replace(/>s*</g, "><").length;
}
</script>
</body>
</html>
```
19 changes: 9 additions & 10 deletions nexttodo.md
Original file line number Diff line number Diff line change
@@ -1,28 +1,27 @@
P0
* OptionsBuilder: replace by Object.assign
* update README for main features
* Support setting entities externally as option configuration
* &#xD; : https://github.com/NaturalIntelligence/fast-xml-parser/issues/342
* Write UT for nested stop node
* support stop nodes expression like head.*.meta

P1
* special characters such as '&amp;'
https://github.com/NaturalIntelligence/fast-xml-parser/issues/297
https://github.com/NaturalIntelligence/fast-xml-parser/issues/343
https://github.com/NaturalIntelligence/fast-xml-parser/issues/342
* skip parsing of particular tag
* doctype support
* Es6 modules
* Parse JSON string to XML. Currently it transforms JSON object to XML. Partially done. Need to work on performance.
* boolean tag to support HTML parsing
* https://github.com/NaturalIntelligence/fast-xml-parser/issues/206

P2
* Multiple roots
* skip parsing of after some tag
* validate XML stream data
* Parse JSON string to XML. Currently it transforms JSON object to XML. Partially done. Need to work on performance.
* Accept streams, arrayBuffer
https://github.com/NaturalIntelligence/fast-xml-parser/issues/347
* XML to JSON ML : https://en.wikipedia.org/wiki/JsonML






----

Entities
Expand Down
80 changes: 80 additions & 0 deletions spec/html_spec.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@

const {XMLParser, XMLBuilder} = require("../src/fxp");

describe("XMLParser", function() {

it("should parse HTML with basic entities, <pre>, <script>, <br>", function() {
const html = `
<html lang="en">
<head>
<script>
window.dataLayer = window.dataLayer || [];
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
gtag('config', 'UA-80202630-2');
</script>
<title>Fast XMl Parser</title>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<link rel="stylesheet" href="static/css/bootstrap.min.css">
<link rel="stylesheet" href="static/css/jquery-confirm.min.css">
<link rel="stylesheet" type="text/css" href="style.css">
<script src="static/js/jquery-3.2.1.min.js"></script>
<style>
.CodeMirror{
height: 100%;
width: 100%;
}
</style>
</head>
<body role="document" style="background-color: #2c3e50;">
<h1>Heading</h1>
<hr>
<h2>&inr;</h2>
<pre>
<h1>Heading</h1>
<hr>
<h2>&inr;</h2>
</pre>
<script>
let highlightedLine = null;
let editor;
<!-- this should not be parsed separately -->
function updateLength(){
const xmlData = editor.getValue();
$("#lengthxml")[0].innerText = xmlData.replace(/>\s*</g, "><").length;
}
</script>
</body>
</html>`;

const parsingOptions = {
ignoreAttributes: false,
preserveOrder: true,
unpairedTags: ["hr", "br", "link", "meta"],
stopNodes : [ "*.pre", "*.script"],
processEntities: true,
htmlEntities: true
};
const parser = new XMLParser(parsingOptions);
let result = parser.parse(html);
// console.log(JSON.stringify(result, null,4));

const builderOptions = {
ignoreAttributes: false,
format: true,
preserveOrder: true,
suppressEmptyNode: true,
unpairedTags: ["hr", "br", "link", "meta"],
stopNodes : [ "*.pre", "*.script"],
}
const builder = new XMLBuilder(builderOptions);
let output = builder.build(result);
// console.log(output);
output = output.replace('₹','&inr;');
expect(output.replace(/\s+/g, "")).toEqual(html.replace(/\s+/g, ""));
});
});
1 change: 1 addition & 0 deletions src/fxp.d.ts
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ type XmlBuilderOptions = {
suppressEmptyNode: boolean;
preserveOrder: boolean;
unpairedTags: string[];
stopNodes: string[];
tagValueProcessor: (name: string, value: string) => string;
attributeValueProcessor: (name: string, value: string) => string;
processEntities: boolean;
Expand Down
4 changes: 3 additions & 1 deletion src/xmlbuilder/json2xml.js
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,8 @@ const defaultOptions = {
"sQuot" : { regex: new RegExp("\'", "g"), val: "&apos;" },
"dQuot" : { regex: new RegExp("\"", "g"), val: "&quot;" }
},
processEntities: true
processEntities: true,
stopNodes: []
};

const props = [
Expand All @@ -47,6 +48,7 @@ const props = [
"unpairedTags",
"entities",
"processEntities",
"stopNodes",
// 'rootNodeName', //when jsObject have multiple properties on root level
];

Expand Down
Loading

0 comments on commit 6f10c26

Please sign in to comment.