TypeError: Implicit conversion to a NumPy array is not allowed. Please use `.get()` to construct a NumPy array explicitly. #744

chrisammon3000 · 2022-09-26T19:54:02Z

Working with BERTopic in a GPU Colab notebook running Python 3.7.14 trying to perform topic modeling on a document consisting of a single string 11412 characters long:

>>> print(transcript)
Hello friends, it's me today. We're checking out some cool things that I learned on tik-tok I do learn a lot of things on tik-tok how to remove a weed on green green is golf grass It's fake grass, right? Is it golf grass? Wait, I'm starting to think that's real grass So you literally just cut out a hole remove the entire weed and then put it back in like it's a piece of cake That mama said you can't eat yet wipe away the crumbs wipe away the evidence That's really how they do it. ...

>>> len(transcript)
11412

Code:

import spacy
from bertopic import BERTopic

spacy.require_gpu()
nlp = spacy.load("en_core_web_lg", exclude=['tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer'])

topic_model = BERTopic(embedding_model=nlp)

# passing input as an iterable
topics, probabilities = topic_model.fit_transform([transcript])

Error result:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
[<ipython-input-49-6159c101b53a>](https://localhost:8080/#) in <module>
----> 1 topics, probabilities = topic_model.fit_transform([transcript])

3 frames
[/usr/local/lib/python3.7/dist-packages/bertopic/backend/_spacy.py](https://localhost:8080/#) in embed(self, documents, verbose)
     95                     vector = self.embedding_model("An empty document").vector
     96                 embeddings.append(vector)
---> 97             embeddings = np.array(embeddings)
     98 
     99         return embeddings

cupy/_core/core.pyx in cupy._core.core.ndarray.__array__()

TypeError: Implicit conversion to a NumPy array is not allowed. Please use `.get()` to construct a NumPy array explicitly.

When I break the document up by groups of three sentences, I get the same error:

docs = [
  ". ".join(transcript.split(". ")[:3]),
  ". ".join(transcript.split(". ")[3:6]),
  ". ".join(transcript.split(". ")[6:9]),
  ". ".join(transcript.split(". ")[9:12]),
  ". ".join(transcript.split(". ")[12:15])
]

topics, probabilities = topic_model.fit_transform(docs)

TypeError: Implicit conversion to a NumPy array is not allowed. Please use `.get()` to construct a NumPy array explicitly.

Also finding the same error with the quickstart tutorial:

from sklearn.datasets import fetch_20newsgroups

docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']
topic_model.fit_transform(docs[:10])

Is there some incompatibility in the environment, Python version or another package version like NumPy or CuPy that could be causing this? Or am I using it incorrectly?

MaartenGr · 2022-09-27T09:01:34Z

I believe spacy has changed the way it handles empty documents when creating vectors which was not accounted for in BERTopic. I'll have to do some more research to see if the issue can be handled better.

metasyn · 2023-04-03T21:18:45Z

I see you referenced a fix in this commit: a7927a2

But looking at whats on master - I don't see the fix there. Am I looking in the wrong place?
https://github.com/MaartenGr/BERTopic/blob/master/bertopic/backend/_spacy.py#L80-L92

The reason I ask is because I ran into this error today on version 0.14.1

MaartenGr · 2023-04-04T09:10:44Z

@metasyn Could you create a reproducible example with 0.14.1? That way, it becomes a bit easier to see what exactly is happening here.

metasyn · 2023-04-04T17:14:45Z

Totally, I should've done that initially.

import sys
from typing import List

import bertopic
import cupy
import en_core_web_lg
import spacy

# This is required to ensure we're cupy/cuda/GPUs
spacy.require_gpu()


def log_versions():
    print(f"python version: {sys.version}")
    print(f"bertopic version: {bertopic.__version__}")
    print(f"spacy version: {spacy.__version__}")
    print(f"en_core_web_lg - spacy model version: {en_core_web_lg.__version__}")
    print(f"CUDA 11.7 - cupy version: {cupy.__version__}")


def get_sample_input():
    """From https://en.wikipedia.org/wiki/Golden_Gate_Bridge."""
    return """
        The Golden Gate Bridge is a suspension bridge spanning the Golden Gate, the
        one-mile-wide (1.6 km) strait connecting San Francisco Bay and the Pacific
        Ocean. The structure links the U.S. city of San Francisco, California—the
        northern tip of the San Francisco Peninsula—to Marin County, carrying both U.S.
        Route 101 and California State Route 1 across the strait. It also carries
        pedestrian and bicycle traffic, and is designated as part of U.S. Bicycle Route
        95. Recognized by the American Society of Civil Engineers as one of the Wonders
        of the Modern World,[7] the bridge is one of the most internationally
        recognized symbols of San Francisco and California.

        The idea of a fixed link between San Francisco and Marin had gained increasing
        popularity during the late 19th century, but it was not until the early 20th
        century that such a link became feasible. Joseph Strauss served as chief
        engineer for the project, with Leon Moisseiff, Irving Morrow and Charles Ellis
        making significant contributions to its design. The bridge opened to the public
        in 1937 and has undergone various retrofits and other improvement projects in
        the decades since.

        The Golden Gate Bridge is described in Frommer's travel guide as "possibly the
        most beautiful, certainly the most photographed, bridge in the world."[8][9] At
        the time of its opening in 1937, it was both the longest and the tallest
        suspension bridge in the world, titles it held until 1964 and 1998
        respectively. Its main span is 4,200 feet (1,280 m) and total height is 746
        feet (227 m).[10]

    """


def filter_func(doc: spacy.tokens.Doc) -> List[str]:
    return [
        token.lemma_.lower()
        for token in doc
        if len(token.text) > 2  # acronyms, typos
        and not token.is_stop  # stop words
        and not token.is_punct  # punctuation
    ]


def get_word_lists(nlp: spacy.Language, text: str) -> List[str]:
    return [" ".join(filter_func(s.as_doc())) for s in nlp(text).sents]


def repro():
    nlp = spacy.load("en_core_web_lg")
    text = get_sample_input()
    word_lists = get_word_lists(nlp, text)
    print(word_lists)

    # This is fine
    topic_model = bertopic.BERTopic(embedding_model=nlp)

    # The next line errors
    topics, _ = topic_model.fit_transform(word_lists)
    print(topics)


if __name__ == "__main__":
    log_versions()
    repro()

Gives me:

python version: 3.10.10 (main, Apr  3 2023, 08:04:30) [GCC 11.3.0]
bertopic version: 0.14.1
spacy version: 3.4.4
en_core_web_lg - spacy model version: 3.4.1
CUDA 11.7 - cupy version: 10.6.0
['\n         golden gate bridge suspension bridge span golden gate \n         mile wide 1.6 strait connect san francisco bay pacific \n         ocean', 'structure link u.s. city san francisco california \n         northern tip san francisco peninsula marin county carry u.s. \n         route 101 california state route strait', 'carry \n         pedestrian bicycle traffic designate u.s. bicycle route \n        ', 'recognize american society civil engineers wonders \n         modern world,[7 bridge internationally \n         recognize symbol san francisco california \n\n        ', 'idea fix link san francisco marin gain increase \n         popularity late 19th century early 20th \n         century link feasible', 'joseph strauss serve chief \n         engineer project leon moisseiff irving morrow charles ellis \n         make significant contribution design', 'bridge open public \n         1937 undergo retrofit improvement project \n         decade \n\n        ', 'golden gate bridge describe frommer travel guide possibly \n         beautiful certainly photograph bridge world "[8][9', '\n         time opening 1937 long tall \n         suspension bridge world title hold 1964 1998 \n         respectively', 'main span 4,200 foot 1,280 total height 746 \n         foot 227 m).[10 \n\n    ']
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[2], line 80
     78 if __name__ == "__main__":
     79     log_versions()
---> 80     repro()

Cell In[2], line 74, in repro()
     71 topic_model = bertopic.BERTopic(embedding_model=nlp)
     73 # The next line errors
---> 74 topics, _ = topic_model.fit_transform(word_lists)
     75 print(topics)

File /usr/local/lib/python3.10/site-packages/bertopic/_bertopic.py:344, in BERTopic.fit_transform(self, documents, embeddings, y)
    341 if embeddings is None:
    342     self.embedding_model = select_backend(self.embedding_model,
    343                                           language=self.language)
--> 344     embeddings = self._extract_embeddings(documents.Document,
    345                                           method="document",
    346                                           verbose=self.verbose)
    347     logger.info("Transformed documents to Embeddings")
    348 else:

File /usr/local/lib/python3.10/site-packages/bertopic/_bertopic.py:2828, in BERTopic._extract_embeddings(self, documents, method, verbose)
   2826     embeddings = self.embedding_model.embed_words(documents, verbose)
   2827 elif method == "document":
-> 2828     embeddings = self.embedding_model.embed_documents(documents, verbose)
   2829 else:
   2830     raise ValueError("Wrong method for extracting document/word embeddings. "
   2831                      "Either choose 'word' or 'document' as the method. ")

File /usr/local/lib/python3.10/site-packages/bertopic/backend/_base.py:69, in BaseEmbedder.embed_documents(self, document, verbose)
     55 def embed_documents(self,
     56                     document: List[str],
     57                     verbose: bool = False) -> np.ndarray:
     58     """ Embed a list of n words into an n-dimensional
     59     matrix of embeddings
     60
   (...)
     67         that each have an embeddings size of `m`
     68     """
---> 69     return self.embed(document, verbose)

File /usr/local/lib/python3.10/site-packages/bertopic/backend/_spacy.py:92, in SpacyBackend.embed(self, documents, verbose)
     90     for doc in tqdm(documents, position=0, leave=True, disable=not verbose):
     91         embeddings.append(self.embedding_model(doc or empty_document).vector)
---> 92     embeddings = np.array(embeddings)
     94 return embeddings

File cupy/_core/core.pyx:1397, in cupy._core.core.ndarray.__array__()

TypeError: Implicit conversion to a NumPy array is not allowed. Please use `.get()` to construct a NumPy array explicitly.

Are there additional details I can provide?

best,
xander

metasyn · 2023-04-04T18:44:34Z

Oh, I realized I can simplify that a bit, here is a more minimal repro:

import sys

import bertopic
import cupy
import en_core_web_lg
import spacy

# This is required to ensure we're cupy/cuda/GPUs
spacy.require_gpu()


def log_versions():
    print(f"python version: {sys.version}")
    print(f"bertopic version: {bertopic.__version__}")
    print(f"spacy version: {spacy.__version__}")
    print(f"en_core_web_lg - spacy model version: {en_core_web_lg.__version__}")
    print(f"CUDA 11.7 - cupy version: {cupy.__version__}")


def get_word_lists():
    """From https://en.wikipedia.org/wiki/Golden_Gate_Bridge."""
    return """
        The Golden Gate Bridge is a suspension bridge spanning the Golden Gate, the
        one-mile-wide (1.6 km) strait connecting San Francisco Bay and the Pacific
        Ocean. The structure links the U.S. city of San Francisco, California—the
        northern tip of the San Francisco Peninsula—to Marin County, carrying both U.S.
        Route 101 and California State Route 1 across the strait. It also carries
        pedestrian and bicycle traffic, and is designated as part of U.S. Bicycle Route
        95. Recognized by the American Society of Civil Engineers as one of the Wonders
        of the Modern World,[7] the bridge is one of the most internationally
        recognized symbols of San Francisco and California.

        The idea of a fixed link between San Francisco and Marin had gained increasing
        popularity during the late 19th century, but it was not until the early 20th
        century that such a link became feasible. Joseph Strauss served as chief
        engineer for the project, with Leon Moisseiff, Irving Morrow and Charles Ellis
        making significant contributions to its design. The bridge opened to the public
        in 1937 and has undergone various retrofits and other improvement projects in
        the decades since.

        The Golden Gate Bridge is described in Frommer's travel guide as "possibly the
        most beautiful, certainly the most photographed, bridge in the world."[8][9] At
        the time of its opening in 1937, it was both the longest and the tallest
        suspension bridge in the world, titles it held until 1964 and 1998
        respectively. Its main span is 4,200 feet (1,280 m) and total height is 746
        feet (227 m).[10]

    """.split()


def repro():
    nlp = spacy.load("en_core_web_lg")
    word_lists = get_word_lists()
    print(word_lists)

    # This is fine
    topic_model = bertopic.BERTopic(embedding_model=nlp)

    # The next line errors
    topics, _ = topic_model.fit_transform(word_lists)
    print(topics)


if __name__ == "__main__":
    log_versions()
    repro()

MaartenGr · 2023-04-06T06:14:10Z

I am not getting the error when I run your code on a CPU. I believe that en_core_web_lg is actually a CPU-optimized model which might explain the error you are getting.

metasyn · 2023-04-10T18:25:04Z

I am also not getting the error when running on a CPU. It seems you had this fix in earlier:

a7927a2#diff-06119c27943e751ff191ded5f03370df0e9e55afa3aeab96b8a2588ccb1cb6a0R97-R101

Is this an approach we could pursue?

MaartenGr · 2023-04-11T05:53:43Z

@metasyn Yeah, that should solve the issue I think. It's strange though, I think something went wrong with merging branches there. If you have the time and want to do a PR, that would be greatly appreciated. Otherwise, I might have some time in the coming weeks to look at this.

metasyn · 2023-04-11T19:00:26Z

Sounds good: I've opened a PR here #1179

MaartenGr added a commit that referenced this issue Nov 29, 2022

Fix #744

a7927a2

MaartenGr mentioned this issue Nov 29, 2022

v0.13 #840

Merged

MaartenGr closed this as completed Jan 9, 2023

metasyn mentioned this issue Apr 11, 2023

[bugfix] add additional logic to handle cupy arrays - fixes #744 #1179

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TypeError: Implicit conversion to a NumPy array is not allowed. Please use `.get()` to construct a NumPy array explicitly. #744

TypeError: Implicit conversion to a NumPy array is not allowed. Please use `.get()` to construct a NumPy array explicitly. #744

chrisammon3000 commented Sep 26, 2022 •

edited

Loading

MaartenGr commented Sep 27, 2022

metasyn commented Apr 3, 2023 •

edited

Loading

MaartenGr commented Apr 4, 2023

metasyn commented Apr 4, 2023 •

edited

Loading

metasyn commented Apr 4, 2023 •

edited

Loading

MaartenGr commented Apr 6, 2023

metasyn commented Apr 10, 2023

MaartenGr commented Apr 11, 2023

metasyn commented Apr 11, 2023

TypeError: Implicit conversion to a NumPy array is not allowed. Please use .get() to construct a NumPy array explicitly. #744

TypeError: Implicit conversion to a NumPy array is not allowed. Please use .get() to construct a NumPy array explicitly. #744

Comments

chrisammon3000 commented Sep 26, 2022 • edited Loading

MaartenGr commented Sep 27, 2022

metasyn commented Apr 3, 2023 • edited Loading

MaartenGr commented Apr 4, 2023

metasyn commented Apr 4, 2023 • edited Loading

metasyn commented Apr 4, 2023 • edited Loading

MaartenGr commented Apr 6, 2023

metasyn commented Apr 10, 2023

MaartenGr commented Apr 11, 2023

metasyn commented Apr 11, 2023

TypeError: Implicit conversion to a NumPy array is not allowed. Please use `.get()` to construct a NumPy array explicitly. #744

TypeError: Implicit conversion to a NumPy array is not allowed. Please use `.get()` to construct a NumPy array explicitly. #744

chrisammon3000 commented Sep 26, 2022 •

edited

Loading

metasyn commented Apr 3, 2023 •

edited

Loading

metasyn commented Apr 4, 2023 •

edited

Loading

metasyn commented Apr 4, 2023 •

edited

Loading