Skip to content

Commit

Permalink
Add simple file loader
Browse files Browse the repository at this point in the history
  • Loading branch information
MichaelKohler committed Oct 27, 2020
1 parent 2bb066d commit c0f3c81
Show file tree
Hide file tree
Showing 4 changed files with 68 additions and 1 deletion.
10 changes: 10 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,16 @@ cargo run -- extract -l en -d ../wikiextractor/text/ >> wiki.en.txt

*Tip: You don't need this last process to finish to start observing the output, wiki.en.txt should get a few thousands sentences in just a few minutes, and you can use that as a way to estimate the quality of the output early on and stop the process if you are not happy.*

### Extract from line break separated files

If you have one or multiple files with one sentence per line, you can use this extractor to extract sentences from these files applying the defined language rules. This can be useful if you have a large list of sentences and you want to only have sentences which match the rules.

By default you can extract 10000 sentences per file.

```
cargo run -- extract-file -l en -d ../texts/ >> file.en.txt
```

## Using language rules

The following rules can be configured per language. Add a `<language>.toml` file in the `rules` directory to enable a new locale.
Expand Down
17 changes: 16 additions & 1 deletion src/app.rs
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ use clap::{App, Arg, ArgMatches, SubCommand};
use std::ffi::OsString;

use crate::extractor::extract;
use crate::loaders::{Wikipedia};
use crate::loaders::{File, Wikipedia};

const VERSION: &str = env!("CARGO_PKG_VERSION");

Expand Down Expand Up @@ -41,6 +41,12 @@ where
.arg(&language_argument)
.arg(&directory_argument)
)
.subcommand(
SubCommand::with_name("extract-file")
.about("Extract sentences from files which have one sentence per line")
.arg(&language_argument)
.arg(&directory_argument)
)
.get_matches_from(itr)
}

Expand All @@ -65,6 +71,15 @@ fn start(all_matches: ArgMatches) -> Result<(), String> {
return extract(wikipedia_loader, no_check);
}

// File
if let Some(matches) = all_matches.subcommand_matches("extract-file") {
let language = String::from(matches.value_of("language").unwrap_or("en"));
let directory = String::from(matches.value_of("dir").unwrap_or_default());

let file_loader = File::new(language, directory);
return extract(file_loader, no_check);
}

println!("{}", all_matches.usage());
Err(String::from("Did you forget to add a subcommand?"))
}
40 changes: 40 additions & 0 deletions src/loaders/file.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
use std::fs::File;
use std::io::Read;
use std::path::PathBuf;

use super::definition::Loader;
use crate::config::Config;

pub struct FileLoader {
pub config: Config,
}

impl FileLoader {
pub fn new(language: String, directory: String) -> Self {
let config = Config {
language,
directory,
max_sentences_per_text: std::usize::MAX,
file_prefix: String::from(""),
};

Self { config }
}
}

impl Loader for FileLoader {
fn get_config(&self) -> &Config {
&self.config
}

fn load(&self, file_name: &PathBuf) -> Result<Vec<String>, String> {
let mut file = File::open(file_name).map_err(|e| format!("{}", e))?;
let mut all_sentences = String::new();
file.read_to_string(&mut all_sentences)
.map_err(|e| format!("{}", e))?;
Ok(all_sentences
.lines()
.map(|sentence| String::from(sentence))
.collect())
}
}
2 changes: 2 additions & 0 deletions src/loaders/mod.rs
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
pub use wikipedia::Wikipedia;
pub use file::FileLoader as File;
pub use definition::Loader;

pub mod wikipedia;
pub mod file;
mod definition;

0 comments on commit c0f3c81

Please sign in to comment.