eurlex.js is a command line utility to retrieve documents (specifically: regulation drafts) in all supported languages from the EUR-Lex website and convert them into JSON. It is made with node.js and can be installed locally via npm.
eurlex.js can be installed using npm:
npm install -g eurlex
Of course you must have node node.js with npm installed.
eurlex.js works fine with Linux, *BSD and Darwin, but never was tested with Win32.
Once installed you can use eurlex on the command line:
eurlex [options] <EUR-Lex URI>
You get a brief description of all the options with
eurlex --help
If you are curious what it looks like to get and convert something, try:
eurlex -vu -l de,en,fr COM:2012:0011:FIN -o eurlex-com-2012-0011-fin.json
Since the HTML otuput of Eurlex is pretty far from being machine readable, eurlex.js applies a lot of magic to read it anyway. The magic can be fine tuned with setting in a file called profile.json
. Here is a stripped and commented version of profile.json
:
{
"lang": ["en","de","..."], // array of avalable languages
"expressions": { // regular expressions
"lang": "...", // to match the language of the document
"title": "..." // to match the title of the document
},
"delimiters": { // delimiters (they are all regex)
"en": { // for this language
"recitals": ["...","..."], // start and end of recitals
"articles": ["...","..."], // start and end of articles
"chapter": "^CHAPTER ", // string to match a chapter
"section": "^SECTION ", // string to match a section
"article": "^Article ", // string to match an article
"fixes": [ // before a line is parsed
["...","..."], // .replace(/first/, "second")
["...","..."] // as many as you need
]
},
"lv": {
"recitals": ["...","..."],
"articles": ["...","..."],
"chapter": [ // if this is an array
"^([XVI]+) NODAĻA", // if matches: chapter
"^([XVI]+) NODAĻA$", // if matches: text missing
"^([XVI]+) NODAĻA (.*)$" // $1 is the literal, $2 is the text
],
"section": [ // same here...
"^([0-9]+)\\. IEDAĻA",
"^([0-9]+)\\. IEDAĻA$",
"^([0-9]+)\\. IEDAĻA (.*)$"
],
"article": [ // note! for article[3]
"^([0-9]+)\\. pants", // $1 is the literal, __$3__ is the text
"^([0-9]+)\\. pants$",
"^([0-9]+)(\\.) pants (.*)$"
],
"fixes": [] // fixes indeed can be empty
}
}
}
- In Magyar, paragraphs and points partly use the same literal enclosures, which leads to paragraphs will be interpreted as headless points. You should be safe using
--unify
with another language as first parameter. - The translations for Malti are formatted pretty crappy and have redundant fragments. You have to hardly rely on the fixes in your profile.json
eurlex.js is licensed under EUPL