newscat

Overview

newscat provides fast and accurate news content / article extraction. It is trained to extract the contents of news articles, while excluding clutter like image and video captions, editorial side notes, related content, teasers, advertisements, user comments and meta data.

Getting Started

To download newscat, run

go get github.com/slyrz/newscat

This will download the source code of newscat and, if not present, newscat's only non-standard build dependency - the html package from the go.net networking libraries. Then run

go build github.com/slyrz/newscat

to build newscat. This should produce a newscat binary file in your $GOPATH/bin directory.

Usage

newscat accepts file paths and HTTP URLs as command line arguments.

newscat [PATH|URL]...

If no arguments are passed, newscat expects HTML written to its standard input.

newscat < PATH

It prints the extracted article text to standard output. If you want properly formatted paragraphs, pipe newscat's output to the fmt command.

newscat ... | fmt

Training and Evaluation

300 news articles were gathered by crawling top submissions from various news-related Reddit communities. A golden standard was created for every HTML page by manually adding a custom HTML5 data attribute to the elements containing relevant content.

The data set was split into training and test data. 50 randomly chosen news articles were used to train newscat. The remaining 250 news articles were used to evaluate the content extraction quality.

For each news article, we compared the element-level predictions with the real labels and calculated precision, recall and the balanced F-score.

The above figure shows the resulting F-scores ordered in increasing magnitude. The x-axis shows the percentage of articles whose F-scores fall below the value indicated by the y-axis. In other words: the percentiles.

License

newscat is released under MIT license. You can find a copy of the MIT License in the LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 157 Commits
html		html
img		img
model		model
util		util
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
TEST.md		TEST.md
main.go		main.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

newscat

Overview

Getting Started

Usage

Training and Evaluation

License

About

Releases

Packages

Languages

License

slyrz/newscat

Folders and files

Latest commit

History

Repository files navigation

newscat

Overview

Getting Started

Usage

Training and Evaluation

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages