You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Briefly describe the issue you’re experiencing or the bug you’ve found.
code
import os
from dotenv import load_dotenv
from agno.agent import Agent
from agno.embedder.azure_openai import AzureOpenAIEmbedder
from agno.knowledge.pdf import PDFKnowledgeBase, PDFImageReader, PDFReader
from agno.vectordb.pgvector import PgVector, SearchType
from agno.models.openai import OpenAIChat
from agno.storage.agent.postgres import PostgresAgentStorage
from sqlalchemy import create_engine, inspect, text
from agno.vectordb.pgvector.index import Ivfflat, HNSW
from agno.embedder.openai import OpenAIEmbedder
#from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_text_splitters import RecursiveCharacterTextSplitter
from agno.document.chunking.recursive import RecursiveChunking
Function to check if a PDF file is already in the database
def is_pdf_in_db(engine, schema, table_name, pdf_name):
pdf_name_no_ext = os.path.splitext(pdf_name)[0] # Remove the .pdf extension
print(f"Checking if PDF '{pdf_name_no_ext}' exists in table '{table_name}'...")
with engine.connect() as connection:
result = connection.execute(
text(f"SELECT 1 FROM {schema}.{table_name} WHERE name = :name"),
{"name": pdf_name_no_ext}
)
exists = result.fetchone() is not None
print(f"PDF '{pdf_name_no_ext}' exists in table '{table_name}': {exists}")
return exists
Set up the PDF knowledge base with vector database
print("Setting up PDF knowledge base with vector database...")
pdf_knowledge_base = PDFKnowledgeBase(
path="D:\Projects\agentic_new_rag\pdfs",
vector_db=PgVector(
table_name="updated_rag009",
schema='ai',
db_url=db_url,
search_type=SearchType.hybrid,
vector_index=HNSW(),
embedder = OpenAIEmbedder(
api_key=OPENAI_API_KEY,
id="text-embedding-ada-002",
dimensions=1536,
encoding_format="float"
)
),
reader=PDFImageReader(chunk=False), # Use a default reader,
chunking_strategy=RecursiveChunking(chunk_size=4000,overlap = 800),
documents=3,
)
#chunking_strategy=RecursiveChunking(),
Define the PgAgentStorage with connection to database
Create a SQLAlchemy engine
engine = create_engine(db_url)
Before loading, check if the table exists and create if not
if not check_table_exists(engine, "ai", "updated_rag009"):
print("Table does not exist. Creating the table...")
pdf_knowledge_base.load(recreate=True, upsert=True) # Create the table
else:
print("Table exists. Skipping table creation...")
pdf_knowledge_base.load(recreate=False, skip_existing=True) # Skip existing table
Check if PDFs are already in the database and process accordingly
pdf_folder = "D:\Projects\agentic_new_rag\pdfs"
for pdf_file in os.listdir(pdf_folder):
if pdf_file.endswith(".pdf"):
if not is_pdf_in_db(engine, "ai", "updated_rag009", pdf_file):
print(f"Processing {pdf_file}...")
# Process the PDF file with PDFImageReader
pdf_reader = PDFImageReader(chunk=True)
pdf_reader.read(os.path.join(pdf_folder, pdf_file))
else:
print(f"Skipping {pdf_file}, already in database.")
Initialize the RAG agent
print("Initializing RAG Agent...")
rag_agent = Agent(
name="Agentic RAG Application",
agent_id="rag-agent",
model=OpenAIChat(id="gpt-4o-mini"),
knowledge=pdf_knowledge_base,
add_context=True,
search_knowledge=True,
read_chat_history=True,
debug_mode=True,
#storage=storage,
description=(
"You are an intelligent retrieval assistant specialized in utilizing knowledge stored in "
"a curated set of documents related"
),
instructions=[
"you have access of the documents and their corresponding file names are: "
],
markdown=True
)
print("RAG Agent initialized.")
Print the agent's response to a query
rag_agent.print_response("give me full Table 1. Volume 1 of DoDM 5200.01 Cancellation Actions of 520045m", stream=True)
Steps to Reproduce
List the steps needed to encounter this bug or issue.
Agent Configuration (if applicable)
Provide relevant agent configuration.
Agentic Rag configuration iam using
Expected Behavior
What did you expect to happen?
based on the above code it should skip reading this pdf 520045m.pdf by PDFImageReader because this 520045m.pdf already present in the pg_vector even though this pdf present 520045m in pgvector its keep reading everytime when i run the code
but when i use PdfReader it starts skip reading the pdf when this 520045m pdf already got present in the pg vector
the issue is with PdfImageReader its not skipping reading the pdf files
please work on this issue
Actual Behavior
What actually happened instead?
it should skip reading the pdf file actually when the pdf name already exist in table but its not skip reading the pdf by PdfImageReader
Screenshots or Logs (if applicable)
Include any relevant screenshots or error logs that demonstrate the issue.
LOGS of PdfReader
these are the logs for PdfReader since 520045m is already present in the table it skipped reading
PS D:\Projects\agentic_new_rag> & C:/Users/wel/anaconda3/envs/venv/python.exe "d:/Projects/agentic_new_rag/final_Agentic_rag_UI.py"
Environment variables loaded.
Setting up PDF knowledge base with vector database...
PgAgentStorage initialized.
Checking if table 'updated_rag009' exists in schema 'ai'...
Table 'updated_rag009' exists: True
Table exists. Skipping table creation...
INFO Creating collection
INFO Loading knowledge base
INFO Reading: 520045m
INFO Added 0 documents to knowledge base
INFO Reading: CMMC101
INFO Added 0 documents to knowledge base
Checking if PDF '520045m' exists in table 'updated_rag009'...
PDF '520045m' exists in table 'updated_rag009': True
Skipping 520045m.pdf, already in database.
Checking if PDF 'CMMC101' exists in table 'updated_rag009'...
PDF 'CMMC101' exists in table 'updated_rag009': True
Skipping CMMC101.pdf, already in database.
Initializing RAG Agent...
RAG Agent initialized.
LOGS of PdfImageReader 1st run
Environment variables loaded.
Setting up PDF knowledge base with vector database...
PgAgentStorage initialized.
Checking if table 'updated_rag0990' exists in schema 'ai'...
Table 'updated_rag0990' exists: False
Table does not exist. Creating the table...
INFO Dropping collection
INFO Table 'ai.updated_rag0990' does not exist.
INFO Creating collection
INFO Loading knowledge base
INFO Reading: 520045m
INFO Upserted batch of 68 documents.
INFO Added 68 documents to knowledge base
INFO Reading: CMMC101
INFO Upserted batch of 21 documents.
INFO Added 21 documents to knowledge base
Checking if PDF '520045m' exists in table 'updated_rag0990'...
PDF '520045m' exists in table 'updated_rag0990': True
Skipping 520045m.pdf, already in database.
Checking if PDF 'CMMC101' exists in table 'updated_rag0990'...
PDF 'CMMC101' exists in table 'updated_rag0990': True
Skipping CMMC101.pdf, already in database.
Initializing RAG Agent...
RAG Agent initialized.
LOGS of PdfImageReader 2nd run of the same code with same pdf file with PdfImageReader, even though it did not add the any documents to pg vector since pdf file already present in pg vectot but it initially it took lot of time for reading the pdf file by PdfImageReader but for PdfReader as it neither read the pdf file nor added documents ,
Environment variables loaded.
Setting up PDF knowledge base with vector database...
PgAgentStorage initialized.
Checking if table 'updated_rag0990' exists in schema 'ai'...
Table 'updated_rag0990' exists: True
Table exists. Skipping table creation...
INFO Creating collection
INFO Loading knowledge base
INFO Reading: 520045m
INFO Added 0 documents to knowledge base
INFO Reading: CMMC101
INFO Added 0 documents to knowledge base
Checking if PDF '520045m' exists in table 'updated_rag0990'...
PDF '520045m' exists in table 'updated_rag0990': True
Skipping 520045m.pdf, already in database.
Checking if PDF 'CMMC101' exists in table 'updated_rag0990'...
PDF 'CMMC101' exists in table 'updated_rag0990': True
Skipping CMMC101.pdf, already in database.
Initializing RAG Agent...
RAG Agent initialized.
mahendra867
changed the title
[Bug]Keep reading the the pdf again & again when i use PDFImageReader even when the pdf file already exist in database
[Bug]Keep reading the the pdf file again & again when i use PDFImageReader even when the pdf file already exist in database
Feb 5, 2025
Description
Briefly describe the issue you’re experiencing or the bug you’ve found.
code
import os
from dotenv import load_dotenv
from agno.agent import Agent
from agno.embedder.azure_openai import AzureOpenAIEmbedder
from agno.knowledge.pdf import PDFKnowledgeBase, PDFImageReader, PDFReader
from agno.vectordb.pgvector import PgVector, SearchType
from agno.models.openai import OpenAIChat
from agno.storage.agent.postgres import PostgresAgentStorage
from sqlalchemy import create_engine, inspect, text
from agno.vectordb.pgvector.index import Ivfflat, HNSW
from agno.embedder.openai import OpenAIEmbedder
#from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_text_splitters import RecursiveCharacterTextSplitter
from agno.document.chunking.recursive import RecursiveChunking
Load environment variables
load_dotenv()
print("Environment variables loaded.")
Fetch API keys and endpoint from environment variables
AZURE_OPENAI_API_KEY = os.getenv("AZURE_OPENAI_API_KEY")
AZURE_OPENAI_ENDPOINT = os.getenv("AZURE_OPENAI_ENDPOINT")
OPENAI_API_KEY=os.getenv("OPENAI_API_KEY")
db_url = "postgresql+psycopg://ai:ai@localhost:5532/ai"
Function to check if the table exists
def check_table_exists(engine, schema, table_name):
print(f"Checking if table '{table_name}' exists in schema '{schema}'...")
inspector = inspect(engine)
exists = inspector.has_table(table_name, schema=schema)
print(f"Table '{table_name}' exists: {exists}")
return exists
Function to check if a PDF file is already in the database
def is_pdf_in_db(engine, schema, table_name, pdf_name):
pdf_name_no_ext = os.path.splitext(pdf_name)[0] # Remove the .pdf extension
print(f"Checking if PDF '{pdf_name_no_ext}' exists in table '{table_name}'...")
with engine.connect() as connection:
result = connection.execute(
text(f"SELECT 1 FROM {schema}.{table_name} WHERE name = :name"),
{"name": pdf_name_no_ext}
)
exists = result.fetchone() is not None
print(f"PDF '{pdf_name_no_ext}' exists in table '{table_name}': {exists}")
return exists
Set up the PDF knowledge base with vector database
print("Setting up PDF knowledge base with vector database...")
pdf_knowledge_base = PDFKnowledgeBase(
path="D:\Projects\agentic_new_rag\pdfs",
vector_db=PgVector(
table_name="updated_rag009",
schema='ai',
db_url=db_url,
search_type=SearchType.hybrid,
vector_index=HNSW(),
embedder = OpenAIEmbedder(
api_key=OPENAI_API_KEY,
id="text-embedding-ada-002",
dimensions=1536,
encoding_format="float"
)
),
reader=PDFImageReader(chunk=False), # Use a default reader,
chunking_strategy=RecursiveChunking(chunk_size=4000,overlap = 800),
documents=3,
)
#chunking_strategy=RecursiveChunking(),
Define the PgAgentStorage with connection to database
Create a SQLAlchemy engine
engine = create_engine(db_url)
Before loading, check if the table exists and create if not
if not check_table_exists(engine, "ai", "updated_rag009"):
print("Table does not exist. Creating the table...")
pdf_knowledge_base.load(recreate=True, upsert=True) # Create the table
else:
print("Table exists. Skipping table creation...")
pdf_knowledge_base.load(recreate=False, skip_existing=True) # Skip existing table
Check if PDFs are already in the database and process accordingly
pdf_folder = "D:\Projects\agentic_new_rag\pdfs"
for pdf_file in os.listdir(pdf_folder):
if pdf_file.endswith(".pdf"):
if not is_pdf_in_db(engine, "ai", "updated_rag009", pdf_file):
print(f"Processing {pdf_file}...")
# Process the PDF file with PDFImageReader
pdf_reader = PDFImageReader(chunk=True)
pdf_reader.read(os.path.join(pdf_folder, pdf_file))
else:
print(f"Skipping {pdf_file}, already in database.")
Initialize the RAG agent
print("Initializing RAG Agent...")
rag_agent = Agent(
name="Agentic RAG Application",
agent_id="rag-agent",
model=OpenAIChat(id="gpt-4o-mini"),
knowledge=pdf_knowledge_base,
add_context=True,
search_knowledge=True,
read_chat_history=True,
debug_mode=True,
#storage=storage,
description=(
"You are an intelligent retrieval assistant specialized in utilizing knowledge stored in "
"a curated set of documents related"
),
instructions=[
"you have access of the documents and their corresponding file names are: "
)
print("RAG Agent initialized.")
Print the agent's response to a query
rag_agent.print_response("give me full Table 1. Volume 1 of DoDM 5200.01 Cancellation Actions of 520045m", stream=True)
Steps to Reproduce
List the steps needed to encounter this bug or issue.
Agent Configuration (if applicable)
Provide relevant agent configuration.
Agentic Rag configuration iam using
Expected Behavior
What did you expect to happen?
based on the above code it should skip reading this pdf 520045m.pdf by PDFImageReader because this 520045m.pdf already present in the pg_vector even though this pdf present 520045m in pgvector its keep reading everytime when i run the code
but when i use PdfReader it starts skip reading the pdf when this 520045m pdf already got present in the pg vector
the issue is with PdfImageReader its not skipping reading the pdf files
please work on this issue
Actual Behavior
What actually happened instead?
it should skip reading the pdf file actually when the pdf name already exist in table but its not skip reading the pdf by PdfImageReader
Screenshots or Logs (if applicable)
Include any relevant screenshots or error logs that demonstrate the issue.
LOGS of PdfReader
these are the logs for PdfReader since 520045m is already present in the table it skipped reading
PS D:\Projects\agentic_new_rag> & C:/Users/wel/anaconda3/envs/venv/python.exe "d:/Projects/agentic_new_rag/final_Agentic_rag_UI.py"
Environment variables loaded.
Setting up PDF knowledge base with vector database...
PgAgentStorage initialized.
Checking if table 'updated_rag009' exists in schema 'ai'...
Table 'updated_rag009' exists: True
Table exists. Skipping table creation...
INFO Creating collection
INFO Loading knowledge base
INFO Reading: 520045m
INFO Added 0 documents to knowledge base
INFO Reading: CMMC101
INFO Added 0 documents to knowledge base
Checking if PDF '520045m' exists in table 'updated_rag009'...
PDF '520045m' exists in table 'updated_rag009': True
Skipping 520045m.pdf, already in database.
Checking if PDF 'CMMC101' exists in table 'updated_rag009'...
PDF 'CMMC101' exists in table 'updated_rag009': True
Skipping CMMC101.pdf, already in database.
Initializing RAG Agent...
RAG Agent initialized.
LOGS of PdfImageReader 1st run
Environment variables loaded.
Setting up PDF knowledge base with vector database...
PgAgentStorage initialized.
Checking if table 'updated_rag0990' exists in schema 'ai'...
Table 'updated_rag0990' exists: False
Table does not exist. Creating the table...
INFO Dropping collection
INFO Table 'ai.updated_rag0990' does not exist.
INFO Creating collection
INFO Loading knowledge base
INFO Reading: 520045m
INFO Upserted batch of 68 documents.
INFO Added 68 documents to knowledge base
INFO Reading: CMMC101
INFO Upserted batch of 21 documents.
INFO Added 21 documents to knowledge base
Checking if PDF '520045m' exists in table 'updated_rag0990'...
PDF '520045m' exists in table 'updated_rag0990': True
Skipping 520045m.pdf, already in database.
Checking if PDF 'CMMC101' exists in table 'updated_rag0990'...
PDF 'CMMC101' exists in table 'updated_rag0990': True
Skipping CMMC101.pdf, already in database.
Initializing RAG Agent...
RAG Agent initialized.
LOGS of PdfImageReader 2nd run of the same code with same pdf file with PdfImageReader, even though it did not add the any documents to pg vector since pdf file already present in pg vectot but it initially it took lot of time for reading the pdf file by PdfImageReader but for PdfReader as it neither read the pdf file nor added documents ,
Environment variables loaded.
Setting up PDF knowledge base with vector database...
PgAgentStorage initialized.
Checking if table 'updated_rag0990' exists in schema 'ai'...
Table 'updated_rag0990' exists: True
Table exists. Skipping table creation...
INFO Creating collection
INFO Loading knowledge base
INFO Reading: 520045m
INFO Added 0 documents to knowledge base
INFO Reading: CMMC101
INFO Added 0 documents to knowledge base
Checking if PDF '520045m' exists in table 'updated_rag0990'...
PDF '520045m' exists in table 'updated_rag0990': True
Skipping 520045m.pdf, already in database.
Checking if PDF 'CMMC101' exists in table 'updated_rag0990'...
PDF 'CMMC101' exists in table 'updated_rag0990': True
Skipping CMMC101.pdf, already in database.
Initializing RAG Agent...
RAG Agent initialized.
Environment
Possible Solutions (optional)
Suggest any ideas you might have to fix or address the issue.
Additional Context
Add any other context or details about the problem here.
solve this issue as quick as possible
The text was updated successfully, but these errors were encountered: