Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate notebook with HTML for admonitions #152

Merged
merged 2 commits into from
Dec 18, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ all: $(NOTEBOOKS_DIR)
$(NOTEBOOKS_DIR): $(MINIMAL_NOTEBOOK_FILES) sanity_check_$(NOTEBOOKS_DIR)

$(NOTEBOOKS_DIR)/%.ipynb: $(PYTHON_SCRIPTS_DIR)/%.py
jupytext --to notebook $< --output $@
python build_tools/convert-python-script-to-notebook.py $< $@

sanity_check_$(NOTEBOOKS_DIR):
python build_tools/sanity-check.py $(PYTHON_SCRIPTS_DIR) $(NOTEBOOKS_DIR)
138 changes: 138 additions & 0 deletions build_tools/convert-python-script-to-notebook.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
import sys
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a small docstring to explain what is going on in this file / what is its purpose

Copy link
Collaborator Author

@lesteve lesteve Dec 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's a good point, I added small docstrings (be it only to remember it myself) to the functions but not to the module.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in d61c5c1


from docutils.core import publish_from_doctree

from bs4 import BeautifulSoup

from myst_parser.main import MdParserConfig, default_parser

import jupytext

# https://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html#directives
# Docutils supports the following directives:
# Admonitions: attention, caution, danger, error, hint, important, note, tip,
# warning and the generic admonition
all_admonitions = [
"attention",
"caution",
"danger",
"error",
"hint",
"important",
"note",
"tip",
"warning",
"admonition",
]

# follows colors defined by JupterBook CSS
sphinx_name_to_bootstrap = {
"attention": "warning",
"caution": "warning",
"danger": "danger",
"error": "danger",
"hint": "warning",
"important": "info",
"note": "info",
"tip": "warning",
"warning": "danger",
"admonition": "info",
}

all_directive_names = ["{" + adm + "}" for adm in all_admonitions]


def convert_to_html(doc, css_selector=None):
"""Converts docutils document to HTML and select part of it with CSS
selector.
"""
html_str = publish_from_doctree(doc, writer_name="html").decode("utf-8")
html_node = BeautifulSoup(html_str, features="html.parser")

if css_selector is not None:
html_node = html_node.select_one(css_selector)

return html_node


def admonition_html(doc):
"""Returns admonition HTML from docutils document.

Assumes that the docutils document has a single child which is an
admonition.
"""
assert len(doc.children) == 1
adm_node = doc.children[0]
assert adm_node.tagname in all_admonitions
html_node = convert_to_html(doc, "div.admonition")
bootstrap_class = sphinx_name_to_bootstrap[adm_node.tagname]
html_node.attrs["class"] += [f"alert alert-{bootstrap_class}"]
html_node.select_one(
".admonition-title").attrs["style"] = "font-weight: bold;"

return str(html_node)


def replace_admonition_in_cell_source(cell_str):
"""Returns cell source with admonition replaced by its generated HTML.
"""
config = MdParserConfig(renderer="docutils")
parser = default_parser(config)
tokens = parser.parse(cell_str)

admonition_tokens = [
t for t in tokens
if t.type == "fence" and t.info in all_directive_names
]

cell_lines = cell_str.splitlines()
new_cell_str = cell_str

for t in admonition_tokens:
adm_begin, adm_end = t.map
adm_src = "\n".join(cell_lines[adm_begin:adm_end])
adm_doc = parser.render(adm_src)
adm_html = admonition_html(adm_doc)
new_cell_str = new_cell_str.replace(adm_src, adm_html)

return new_cell_str


def replace_admonition_in_nb(nb):
"""Replaces all admonitions by its generated HTML in a notebook object.
"""
# FIXME this would not work with advanced syntax for admonition with :::
# but we are not using it for now. We could parse all the markdowns cell, a
# bit wasteful, but probably good enough
cells_with_admonition = [
(i, c)
for i, c in enumerate(nb["cells"])
if c["cell_type"] == "markdown"
and any(directive in c["source"] for directive in all_directive_names)
]

for i, c in cells_with_admonition:
cell_src = c["source"]
output_src = replace_admonition_in_cell_source(cell_src)
nb.cells[i]["source"] = output_src


def replace_admonition_in_filename(input_filename, output_filename):
"""Converts .py file to .ipynb file where admonitions have been replaced by
their generated HTML.

Context: MyST syntax is not supported inside a Jupyter notebook. This is a
hacky way to keep using MyST admonitions for our JupyterBook and have
acceptable admonition HTML in the Jupyter notebook interface.
"""
nb = jupytext.read(input_filename)

replace_admonition_in_nb(nb)

jupytext.write(nb, output_filename)


if __name__ == "__main__":
input_filename = sys.argv[1]
output_filename = sys.argv[2]
replace_admonition_in_filename(input_filename, output_filename)
31 changes: 17 additions & 14 deletions notebooks/01_tabular_data_exploration.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -96,13 +96,14 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"```{note}\n",
"Classes are slightly imbalanced. Class imbalance happens often in\n",
"<div class=\"admonition note alert alert-info\">\n",
"<p class=\"first admonition-title\" style=\"font-weight: bold;\">Note</p>\n",
"<p class=\"last\">Classes are slightly imbalanced. Class imbalance happens often in\n",
"practice and may need special techniques for machine learning. For example in\n",
"a medical setting, if we are trying to predict whether patients will develop\n",
"a rare disease, there will be a lot more healthy patients than ill patients\n",
"in the dataset.\n",
"```"
"in the dataset.</p>\n",
"</div>"
]
},
{
Expand Down Expand Up @@ -202,17 +203,19 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"```{tip}\n",
"In the code cell, we are using `sns.set_context` to globally change\n",
"<div class=\"admonition tip alert alert-warning\">\n",
"<p class=\"first admonition-title\" style=\"font-weight: bold;\">Tip</p>\n",
"<p class=\"last\">In the code cell, we are using <tt class=\"docutils literal\">sns.set_context</tt> to globally change\n",
"the rendering of the figure with larger fonts and line. We will use this\n",
"call in all notebooks.\n",
"```\n",
"```{tip}\n",
"In the cell, we are calling the following pattern: `_ = func()`. It assigns\n",
"the output of `func()` into the variable called `_`. By convention, in Python\n",
"`_` serves as a \"garbage\" variable to store results that we are not\n",
"interested in.\n",
"```\n",
"call in all notebooks.</p>\n",
"</div>\n",
"<div class=\"admonition tip alert alert-warning\">\n",
"<p class=\"first admonition-title\" style=\"font-weight: bold;\">Tip</p>\n",
"<p class=\"last\">In the cell, we are calling the following pattern: <tt class=\"docutils literal\">_ = func()</tt>. It assigns\n",
"the output of <tt class=\"docutils literal\">func()</tt> into the variable called <tt class=\"docutils literal\">_</tt>. By convention, in Python\n",
"<tt class=\"docutils literal\">_</tt> serves as a \"garbage\" variable to store results that we are not\n",
"interested in.</p>\n",
"</div>\n",
"\n",
"We can already make a few comments about some of the variables:\n",
"\n",
Expand Down
29 changes: 16 additions & 13 deletions notebooks/02_numerical_pipeline_hands_on.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -91,11 +91,12 @@
"\n",
"The first task here will be to identify numerical data in our dataset.\n",
"\n",
"```{caution}\n",
"Numerical data are represented with numbers, but numbers are not always\n",
"<div class=\"admonition caution alert alert-warning\">\n",
"<p class=\"first admonition-title\" style=\"font-weight: bold;\">Caution!</p>\n",
"<p class=\"last\">Numerical data are represented with numbers, but numbers are not always\n",
"representing numerical data. Categories could already be encoded with\n",
"numbers and you will need to identify these features.\n",
"```\n",
"numbers and you will need to identify these features.</p>\n",
"</div>\n",
"\n",
"Thus, we can check the data type for each of the column in the dataset."
]
Expand Down Expand Up @@ -270,21 +271,23 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"```{tip}\n",
"`random_state` parameter allows to get a deterministic results even if we\n",
"use some random process (i.e. data shuffling).\n",
"```\n",
"<div class=\"admonition tip alert alert-warning\">\n",
"<p class=\"first admonition-title\" style=\"font-weight: bold;\">Tip</p>\n",
"<p class=\"last\"><tt class=\"docutils literal\">random_state</tt> parameter allows to get a deterministic results even if we\n",
"use some random process (i.e. data shuffling).</p>\n",
"</div>\n",
"\n",
"In the previous notebook, we used a k-nearest neighbors predictor. While this\n",
"model is really intuitive to understand, it is not widely used. Here, we will\n",
"a predictive model belonging to the linear model families.\n",
"\n",
"```{note}\n",
"In short, these models find a set of weights to combine each column in the\n",
"<div class=\"admonition note alert alert-info\">\n",
"<p class=\"first admonition-title\" style=\"font-weight: bold;\">Note</p>\n",
"<p class=\"last\">In short, these models find a set of weights to combine each column in the\n",
"data matrix to predict the target. For instance, the model can come up with\n",
"rules such as `0.1 * age + 3.3 * education-num - 15.1 > 0` means that\n",
"`high-income` is predicted.\n",
"```\n",
"rules such as <tt class=\"docutils literal\">0.1 * age + 3.3 * <span class=\"pre\">education-num</span> - 15.1 &gt; 0</tt> means that\n",
"<tt class=\"docutils literal\"><span class=\"pre\">high-income</span></tt> is predicted.</p>\n",
"</div>\n",
"\n",
"Thus, we will use a logistic regression classifier and train it."
]
Expand Down
9 changes: 5 additions & 4 deletions notebooks/02_numerical_pipeline_introduction.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -124,11 +124,12 @@
"strategy. The `fit` method is called to train the model from the input\n",
"(features) and target data.\n",
"\n",
"```{caution}\n",
"We use a K-nearest neighbors here. However, be aware that it is seldom useful\n",
"<div class=\"admonition caution alert alert-warning\">\n",
"<p class=\"first admonition-title\" style=\"font-weight: bold;\">Caution!</p>\n",
"<p class=\"last\">We use a K-nearest neighbors here. However, be aware that it is seldom useful\n",
"in practice. We use it because it is an intuitive algorithm. In the next\n",
"notebook, we will introduce better models.\n",
"```"
"notebook, we will introduce better models.</p>\n",
"</div>"
]
},
{
Expand Down
28 changes: 15 additions & 13 deletions notebooks/02_numerical_pipeline_scaling.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -210,19 +210,21 @@
"regarding feature distributions and having normalized feature is usually one\n",
"of them.\n",
"\n",
"```{tip}\n",
"Some of the reasons for scaling features are:\n",
"\n",
"* predictor using Euclidean distance (e.g k-nearest neighbors) should have\n",
" normalized feature such that each feature contribute equally distance\n",
" computation;\n",
"* predictor internally using gradient-descent based algorithms\n",
" (e.g. logistic regression) to find optimal parameters work better\n",
" and faster simplify choice of parameters as learning-rate;\n",
"* predictor using regularization (e.g. logistic regression) require\n",
" normalized features such that the penalty is properly applied to the\n",
" weights.\n",
"```\n",
"<div class=\"admonition tip alert alert-warning\">\n",
"<p class=\"first admonition-title\" style=\"font-weight: bold;\">Tip</p>\n",
"<p>Some of the reasons for scaling features are:</p>\n",
"<ul class=\"last simple\">\n",
"<li>predictor using Euclidean distance (e.g k-nearest neighbors) should have\n",
"normalized feature such that each feature contribute equally distance\n",
"computation;</li>\n",
"<li>predictor internally using gradient-descent based algorithms\n",
"(e.g. logistic regression) to find optimal parameters work better\n",
"and faster simplify choice of parameters as learning-rate;</li>\n",
"<li>predictor using regularization (e.g. logistic regression) require\n",
"normalized features such that the penalty is properly applied to the\n",
"weights.</li>\n",
"</ul>\n",
"</div>\n",
"\n",
"We show how to apply such normalization using a scikit-learn transformer\n",
"called `StandardScaler`. This transformer intend to transform feature such\n",
Expand Down
27 changes: 15 additions & 12 deletions notebooks/03_categorical_pipeline.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -256,10 +256,11 @@
"independently. We can also note that the number of features before and after\n",
"the encoding is the same.\n",
"\n",
"```{tip}\n",
"This encoding was used by the persons who published the dataset to transform\n",
"the `\"education\"` feature into the `\"education-num\"` feature for instance.\n",
"```\n",
"<div class=\"admonition tip alert alert-warning\">\n",
"<p class=\"first admonition-title\" style=\"font-weight: bold;\">Tip</p>\n",
"<p class=\"last\">This encoding was used by the persons who published the dataset to transform\n",
"the <tt class=\"docutils literal\">\"education\"</tt> feature into the <tt class=\"docutils literal\"><span class=\"pre\">\"education-num\"</span></tt> feature for instance.</p>\n",
"</div>\n",
"\n",
"However, one has to be careful when using this encoding strategy. Using this\n",
"integer representation can lead the downstream models to make the assumption\n",
Expand All @@ -281,11 +282,12 @@
"then this encoding might be misleading to downstream statistical models and\n",
"you might consider using one-hot encoding instead (see below).\n",
"\n",
"```{important}\n",
"Note however that the impact of violating this ordering assumption is really\n",
"<div class=\"admonition important alert alert-info\">\n",
"<p class=\"first admonition-title\" style=\"font-weight: bold;\">Important</p>\n",
"<p class=\"last\">Note however that the impact of violating this ordering assumption is really\n",
"dependent on the downstream models (for instance linear models are much more\n",
"sensitive than models built from a ensemble of decision trees).\n",
"```\n",
"sensitive than models built from a ensemble of decision trees).</p>\n",
"</div>\n",
"\n",
"## Encoding nominal categories (without assuming any order)\n",
"\n",
Expand All @@ -299,11 +301,12 @@
"We will start by encoding a single feature (e.g. `\"education\"`) to illustrate\n",
"how the encoding works.\n",
"\n",
"```{note}\n",
"We will pass the argument `sparse=False` to the `OneHotEncoder` which will\n",
"<div class=\"admonition note alert alert-info\">\n",
"<p class=\"first admonition-title\" style=\"font-weight: bold;\">Note</p>\n",
"<p class=\"last\">We will pass the argument <tt class=\"docutils literal\">sparse=False</tt> to the <tt class=\"docutils literal\">OneHotEncoder</tt> which will\n",
"avoid obtaining a sparse matrix, which is less efficient but easier to\n",
"inspect results for didactic purposes.\n",
"```"
"inspect results for didactic purposes.</p>\n",
"</div>"
]
},
{
Expand Down
9 changes: 5 additions & 4 deletions notebooks/cross_validation_train_test.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -297,11 +297,12 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"```{tip}\n",
"By convention, scikit-learn model evaluation tools always use a convention\n",
"<div class=\"admonition tip alert alert-warning\">\n",
"<p class=\"first admonition-title\" style=\"font-weight: bold;\">Tip</p>\n",
"<p class=\"last\">By convention, scikit-learn model evaluation tools always use a convention\n",
"where \"higher is better\", this explains we used\n",
"`scoring=\"neg_mean_absolute_error\"` (meaning \"negative mean absolute error\").\n",
"```\n",
"<tt class=\"docutils literal\"><span class=\"pre\">scoring=\"neg_mean_absolute_error\"</span></tt> (meaning \"negative mean absolute error\").</p>\n",
"</div>\n",
"\n",
"Let us revert the negation to get the actual error:"
]
Expand Down
9 changes: 5 additions & 4 deletions notebooks/ensemble_gradient_boosting.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -119,10 +119,11 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"```{tip}\n",
"In the cell above, we manually edit the legend to get only a single label\n",
"for all residual lines.\n",
"```\n",
"<div class=\"admonition tip alert alert-warning\">\n",
"<p class=\"first admonition-title\" style=\"font-weight: bold;\">Tip</p>\n",
"<p class=\"last\">In the cell above, we manually edit the legend to get only a single label\n",
"for all residual lines.</p>\n",
"</div>\n",
"Since the tree underfits the data, its accuracy is far from perfect on the\n",
"training data. We can observe this in the figure by looking at the difference\n",
"between the predictions and the ground-truth data. We represent these errors,\n",
Expand Down
9 changes: 5 additions & 4 deletions notebooks/ensemble_hist_gradient_boosting.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -109,11 +109,12 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"```{note}\n",
"The code cell above will generate a couple of warnings. Indeed, for some of\n",
"<div class=\"admonition note alert alert-info\">\n",
"<p class=\"first admonition-title\" style=\"font-weight: bold;\">Note</p>\n",
"<p class=\"last\">The code cell above will generate a couple of warnings. Indeed, for some of\n",
"the features, we requested too much bins in regard of the data dispersion for\n",
"those features. The too small bins will be removed.\n",
"```\n",
"those features. The too small bins will be removed.</p>\n",
"</div>\n",
"We see that the discretizer transform the original data into an integer.\n",
"This integer represents the bin index when the distribution by quantile is\n",
"performed. We can check the number of bins per feature."
Expand Down
Loading