-
Notifications
You must be signed in to change notification settings - Fork 764
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TypeError: Implicit conversion to a NumPy array is not allowed. Please use .get()
to construct a NumPy array explicitly.
#744
Comments
I believe spacy has changed the way it handles empty documents when creating vectors which was not accounted for in BERTopic. I'll have to do some more research to see if the issue can be handled better. |
I see you referenced a fix in this commit: a7927a2 But looking at whats on master - I don't see the fix there. Am I looking in the wrong place? The reason I ask is because I ran into this error today on version 0.14.1 |
@metasyn Could you create a reproducible example with 0.14.1? That way, it becomes a bit easier to see what exactly is happening here. |
Totally, I should've done that initially. import sys
from typing import List
import bertopic
import cupy
import en_core_web_lg
import spacy
# This is required to ensure we're cupy/cuda/GPUs
spacy.require_gpu()
def log_versions():
print(f"python version: {sys.version}")
print(f"bertopic version: {bertopic.__version__}")
print(f"spacy version: {spacy.__version__}")
print(f"en_core_web_lg - spacy model version: {en_core_web_lg.__version__}")
print(f"CUDA 11.7 - cupy version: {cupy.__version__}")
def get_sample_input():
"""From https://en.wikipedia.org/wiki/Golden_Gate_Bridge."""
return """
The Golden Gate Bridge is a suspension bridge spanning the Golden Gate, the
one-mile-wide (1.6 km) strait connecting San Francisco Bay and the Pacific
Ocean. The structure links the U.S. city of San Francisco, California—the
northern tip of the San Francisco Peninsula—to Marin County, carrying both U.S.
Route 101 and California State Route 1 across the strait. It also carries
pedestrian and bicycle traffic, and is designated as part of U.S. Bicycle Route
95. Recognized by the American Society of Civil Engineers as one of the Wonders
of the Modern World,[7] the bridge is one of the most internationally
recognized symbols of San Francisco and California.
The idea of a fixed link between San Francisco and Marin had gained increasing
popularity during the late 19th century, but it was not until the early 20th
century that such a link became feasible. Joseph Strauss served as chief
engineer for the project, with Leon Moisseiff, Irving Morrow and Charles Ellis
making significant contributions to its design. The bridge opened to the public
in 1937 and has undergone various retrofits and other improvement projects in
the decades since.
The Golden Gate Bridge is described in Frommer's travel guide as "possibly the
most beautiful, certainly the most photographed, bridge in the world."[8][9] At
the time of its opening in 1937, it was both the longest and the tallest
suspension bridge in the world, titles it held until 1964 and 1998
respectively. Its main span is 4,200 feet (1,280 m) and total height is 746
feet (227 m).[10]
"""
def filter_func(doc: spacy.tokens.Doc) -> List[str]:
return [
token.lemma_.lower()
for token in doc
if len(token.text) > 2 # acronyms, typos
and not token.is_stop # stop words
and not token.is_punct # punctuation
]
def get_word_lists(nlp: spacy.Language, text: str) -> List[str]:
return [" ".join(filter_func(s.as_doc())) for s in nlp(text).sents]
def repro():
nlp = spacy.load("en_core_web_lg")
text = get_sample_input()
word_lists = get_word_lists(nlp, text)
print(word_lists)
# This is fine
topic_model = bertopic.BERTopic(embedding_model=nlp)
# The next line errors
topics, _ = topic_model.fit_transform(word_lists)
print(topics)
if __name__ == "__main__":
log_versions()
repro() Gives me:
Are there additional details I can provide? best, |
Oh, I realized I can simplify that a bit, here is a more minimal repro: import sys
import bertopic
import cupy
import en_core_web_lg
import spacy
# This is required to ensure we're cupy/cuda/GPUs
spacy.require_gpu()
def log_versions():
print(f"python version: {sys.version}")
print(f"bertopic version: {bertopic.__version__}")
print(f"spacy version: {spacy.__version__}")
print(f"en_core_web_lg - spacy model version: {en_core_web_lg.__version__}")
print(f"CUDA 11.7 - cupy version: {cupy.__version__}")
def get_word_lists():
"""From https://en.wikipedia.org/wiki/Golden_Gate_Bridge."""
return """
The Golden Gate Bridge is a suspension bridge spanning the Golden Gate, the
one-mile-wide (1.6 km) strait connecting San Francisco Bay and the Pacific
Ocean. The structure links the U.S. city of San Francisco, California—the
northern tip of the San Francisco Peninsula—to Marin County, carrying both U.S.
Route 101 and California State Route 1 across the strait. It also carries
pedestrian and bicycle traffic, and is designated as part of U.S. Bicycle Route
95. Recognized by the American Society of Civil Engineers as one of the Wonders
of the Modern World,[7] the bridge is one of the most internationally
recognized symbols of San Francisco and California.
The idea of a fixed link between San Francisco and Marin had gained increasing
popularity during the late 19th century, but it was not until the early 20th
century that such a link became feasible. Joseph Strauss served as chief
engineer for the project, with Leon Moisseiff, Irving Morrow and Charles Ellis
making significant contributions to its design. The bridge opened to the public
in 1937 and has undergone various retrofits and other improvement projects in
the decades since.
The Golden Gate Bridge is described in Frommer's travel guide as "possibly the
most beautiful, certainly the most photographed, bridge in the world."[8][9] At
the time of its opening in 1937, it was both the longest and the tallest
suspension bridge in the world, titles it held until 1964 and 1998
respectively. Its main span is 4,200 feet (1,280 m) and total height is 746
feet (227 m).[10]
""".split()
def repro():
nlp = spacy.load("en_core_web_lg")
word_lists = get_word_lists()
print(word_lists)
# This is fine
topic_model = bertopic.BERTopic(embedding_model=nlp)
# The next line errors
topics, _ = topic_model.fit_transform(word_lists)
print(topics)
if __name__ == "__main__":
log_versions()
repro() |
I am not getting the error when I run your code on a CPU. I believe that |
I am also not getting the error when running on a CPU. It seems you had this fix in earlier: a7927a2#diff-06119c27943e751ff191ded5f03370df0e9e55afa3aeab96b8a2588ccb1cb6a0R97-R101 Is this an approach we could pursue? |
@metasyn Yeah, that should solve the issue I think. It's strange though, I think something went wrong with merging branches there. If you have the time and want to do a PR, that would be greatly appreciated. Otherwise, I might have some time in the coming weeks to look at this. |
Sounds good: I've opened a PR here #1179 |
Working with BERTopic in a GPU Colab notebook running Python 3.7.14 trying to perform topic modeling on a document consisting of a single string 11412 characters long:
Code:
Error result:
When I break the document up by groups of three sentences, I get the same error:
Also finding the same error with the quickstart tutorial:
Is there some incompatibility in the environment, Python version or another package version like NumPy or CuPy that could be causing this? Or am I using it incorrectly?
The text was updated successfully, but these errors were encountered: