Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft script to find diverging links #1966

Merged
merged 9 commits into from
Oct 7, 2024
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/community/topics/dependencies-js.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ There are two kinds of dependency definitions in this theme:
To update or add a JS dependency, follow these steps:

1. **Edit `package.json`** by adding or modifying a dependency.
2. **Re-generate `package-lock.json`** in order to create a new set of frozen dependencies for the theme. To do this, run the following command from [the Sphinx Theme Builder](https://github.com/pradyunsg/sphinx-theme-builder).
2. **Re-generate `package-lock.json`** in order to create a new set of frozen dependencies for the theme. To do this, run the following command from [the Sphinx Theme Builder](https://sphinx-theme-builder.readthedocs.io/en/latest/).
Carreau marked this conversation as resolved.
Show resolved Hide resolved

```
stb npm install --include=dev
Expand Down
4 changes: 2 additions & 2 deletions docs/community/topics/manual-dev.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ To do so, use a tool like [conda](https://docs.conda.io/en/latest/), [mamba](htt
Before you start, ensure that you have the following installed:

- Python >= 3.9
- [Pandoc](https://pandoc.org/installing.html): we use `nbsphinx` to support notebook (.ipynb) files in the documentation, which requires [installing Pandoc](https://pandoc.org/installing.html) at a system level (or within a Conda environment).
- [Pandoc](https://pandoc.org/): we use `nbsphinx` to support notebook (`.ipynb`) files in the documentation, which requires [installing Pandoc](https://pandoc.org/installing.html) at a system level (or within a Conda environment).

## Clone the repository locally

Expand Down Expand Up @@ -66,7 +66,7 @@ To manually open a server to watch your documentation for changes, build them, a
$ stb serve docs --open-browser
```

## Run the tests
## Manually Run the tests
Carreau marked this conversation as resolved.
Show resolved Hide resolved

To manually run the tests for this theme, first set up your environment locally, and then run:

Expand Down
2 changes: 1 addition & 1 deletion docs/user_guide/accessibility.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ Site maps, usually served from a file called `sitemap.xml` are a broadly-employe
approach to telling programs like search engines and assistive technologies where
different content appears on a website.

If using a service like [ReadTheDocs](https://readthedocs.com), these files
If using a service like [ReadTheDocs](https://about.readthedocs.com/), these files
will be created for you _automatically_, but for some other approaches below,
it's handy to generate a `sitemap.xml` locally or in CI with a tool like
[sphinx-sitemap](https://pypi.org/project/sphinx-sitemap/).
Expand Down
2 changes: 1 addition & 1 deletion docs/user_guide/indices.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,4 +19,4 @@ By design the indices pages are not linked in a documentation generated with thi

.. note::

Don't forget to add back the ``"sidebar-ethical-ads.html"`` template if you are serving your documentation using `ReadTheDocs <https://readthedocs.org>`__.
Don't forget to add back the ``"sidebar-ethical-ads.html"`` template if you are serving your documentation using `ReadTheDocs <https://about.readthedocs.com/>`__.
106 changes: 106 additions & 0 deletions tools/divergent_links.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
"""This script help checking divergent links.

That is to say, links to the same page,
that have different titles.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My terminology was terrible (divergent versus convergent) but you got my definition backwards. A divergent link is same name, different URLs.

So this got me a bit confused when I reviewed the PR.

I think the code in this PR actually detects divergent links (same name, different URLs), in which case the comment is wrong.

Copy link
Collaborator

@gabalafou gabalafou Sep 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PS. Please let's ditch my terminology. It's terrible. Nobody is going to be able to remember which side of divergent or convergent they are on. We should probably stick with names like same-name-different-URL links and same-URL-different-name links, unless we can think of something creative and memorable.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe, less of a mouthful... name-consistent links (same name, different URLs) versus name-inconsistent links (different names, same URL)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I don't like that either because it makes it sound like one is bad (inconsistent) and the other good (consistent) whereas neither of them are good. So maybe, URL-inconsistent links versus name-inconsistent links.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reworded.

"""
Carreau marked this conversation as resolved.
Show resolved Hide resolved

import os
import sys
from collections import defaultdict

from bs4 import BeautifulSoup

ignores = [
Carreau marked this conversation as resolved.
Show resolved Hide resolved
"#",
"next",
"previous",
"[source]",
"edit on github",
"[docs]",
"read more ...",
"show source",
"module",
]


def find_html_files(folder_path):
"""Find all html files in given folder."""
html_files = []
for root, dirs, files in os.walk(folder_path):
for file in files:
if file.endswith(".html"):
html_files.append(os.path.join(root, file))
return html_files


class Checker:
"""Link checker."""

links: dict[str, list]

def __init__(self):
self.links = defaultdict(list)

def scan(self, html_content, identifier):
"""Scan given file for html links."""
# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(html_content, "html.parser")

# Dictionary to store URLs and their corresponding titles

# Extract all anchor tags
for a_tag in soup.find_all("a", href=True):
url = a_tag["href"]
if url.startswith("#"):
Carreau marked this conversation as resolved.
Show resolved Hide resolved
continue
content = a_tag.text.strip().lower()
if content in ignores:
continue
if content.split("\n")[0] in ignores:
Carreau marked this conversation as resolved.
Show resolved Hide resolved
continue
from urllib.parse import urljoin
Carreau marked this conversation as resolved.
Show resolved Hide resolved

fullurl = urljoin(identifier, url)
Carreau marked this conversation as resolved.
Show resolved Hide resolved
self.links[content].append((fullurl, identifier))

def duplicates(self):
"""Print potential duplicates."""
for content, url_pages in self.links.items():
uniq_url = {u for u, _ in url_pages}
if len(uniq_url) >= 2:
print(
f"{len(url_pages)} time {content!r} has {len(uniq_url)} on divergent url on :"
Carreau marked this conversation as resolved.
Show resolved Hide resolved
)
dct = defaultdict(list)
for u, p in url_pages:
dct[u].append(p)
for u, ps in dct.items():
print(" ", u, "in")
for p in ps:
print(" ", p)


# Example usage
data = """
Carreau marked this conversation as resolved.
Show resolved Hide resolved
<html>
<body>
<a href="https://example.com" title="Example Site">Visit Example</a>
<a href="https://example.com" title="Example Website">Check Example</a>
<a href="https://openai.com" title="OpenAI">Visit OpenAI</a>
<a href="https://openai.com" title="OpenAI">Learn about OpenAI</a>
</body>
</html>
"""

c = Checker()
# Call the function and print results
# inconsistencies = c.scan(data, "C0")

print(sys.argv)

for file in find_html_files(sys.argv[1]):
with open(file) as f:
data = f.read()
c.scan(data, file)

c.duplicates()
Loading