HtmlSharp is a C# library for parsing HTML that allows querying of the DOM with css selectors.
Let’s say you have the following html stored in a string named html.
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/loose.dtd"> <html> <head> <title>Sample HTML</title> </head> <body> <p class="important">Make sure you read this part.</p> <p>Some info about stuff.</p> <p align="left" id="copyright">Copyright data.</p> </body> </html>
To be able to query this document, we’ll need to parse it:
HtmlParser parser = new HtmlParser(); HtmlDocument doc = parser.Parse(html);
From there, we can use css selectors to find the information we want from the document.
Finding the tag with the id of copyright:
Tag t = doc.Find("#copyright");
Automatic casting to access attributes statically:
P paragraph = doc.Find<P>("#copyright"); string align = paragraph.Align; // "left" string class = paragraph.Class; // null
Finding multiple tags:
IEnumerable<Tag> paragraphs = doc.FindAll("p"); // Returns the 3 paragraph objects it finds
Pattern | Meaning |
---|---|
* | any element |
E | an element of type E |
E[foo] | an E element with a “foo” attribute |
E[foo=“bar”] | an E element whose “foo” attribute value is exactly equal to “bar” |
E[foo~=“bar”] | an E element whose “foo” attribute value is a list of space-separated values, one of which is exactly equal to “bar” |
E[foo^=“bar”] | an E element whose “foo” attribute value begins exactly with the string “bar” |
E[foo$=“bar”] | an E element whose “foo” attribute value ends exactly with the string “bar” |
E[foo*=“bar”] | an E element whose “foo” attribute value contains the substring “bar” |
E[hreflang|=“en”] | an E element whose “hreflang” attribute has a hyphen-separated list of values beginning (from the left) with “en” |
E:root | an E element, root of the document |
E:nth-child(n) | an E element, the n-th child of its parent |
E:nth-last-child(n) | an E element, the n-th child of its parent, counting from the last one |
E:nth-of-type(n) | an E element, the n-th sibling of its type |
E:nth-last-of-type(n) | an E element, the n-th sibling of its type, counting from the last one |
E:first-child | an E element, first child of its parent |
E:last-child | an E element, last child of its parent |
E:first-of-type | an E element, first sibling of its type |
E:last-of-type | an E element, last sibling of its type |
E:only-child | an E element, only child of its parent |
E:only-of-type | an E element, only sibling of its type |
E:empty | an E element that has no children (including text nodes) |
E:lang(fr) | an element of type E in language “fr” (the document language specifies how language is determined) |
E:disabled | a user interface element E which is disabled |
E:enabled | a user interface element E which is enabled |
E:checked | a user interface element E which is checked (for instance a radio-button or checkbox) |
E.warning | an E element whose class is “warning” (the document language specifies how class is determined) |
E#myid | an E element with ID equal to “myid” |
E:not(s) | an E element that does not match simple selector s |
E F | an F element descendant of an E element |
E > F | an F element child of an E element |
E + F | an F element immediately preceded by an E element |
E ~ F | an F element preceded by an E element |