MathpixPDFLoader#

class langchain_community.document_loaders.pdf.MathpixPDFLoader(file_path: str, processed_file_format: str = 'md', max_wait_time_seconds: int = 500, should_clean_pdf: bool = False, extra_request_data: Dict[str, Any] | None = None, **kwargs: Any)[source]#

Load PDF files using Mathpix service.

Initialize with a file path.

Parameters:
  • file_path (str) – a file for loading.

  • processed_file_format (str) – a format of the processed file. Default is β€œmd”.

  • max_wait_time_seconds (int) – a maximum time to wait for the response from the server. Default is 500.

  • should_clean_pdf (bool) – a flag to clean the PDF file. Default is False.

  • extra_request_data (Dict[str, Any] | None) – Additional request data.

  • **kwargs (Any) – additional keyword arguments.

Attributes

data

source

url

Methods

__init__(file_path[,Β processed_file_format,Β ...])

Initialize with a file path.

alazy_load()

A lazy loader for Documents.

aload()

Load data into Document objects.

clean_pdf(contents)

Clean the PDF file.

get_processed_pdf(pdf_id)

lazy_load()

A lazy loader for Documents.

load()

Load data into Document objects.

load_and_split([text_splitter])

Load Documents and split into chunks.

send_pdf()

wait_for_processing(pdf_id)

Wait for processing to complete.

__init__(file_path: str, processed_file_format: str = 'md', max_wait_time_seconds: int = 500, should_clean_pdf: bool = False, extra_request_data: Dict[str, Any] | None = None, **kwargs: Any) β†’ None[source]#

Initialize with a file path.

Parameters:
  • file_path (str) – a file for loading.

  • processed_file_format (str) – a format of the processed file. Default is β€œmd”.

  • max_wait_time_seconds (int) – a maximum time to wait for the response from the server. Default is 500.

  • should_clean_pdf (bool) – a flag to clean the PDF file. Default is False.

  • extra_request_data (Dict[str, Any] | None) – Additional request data.

  • **kwargs (Any) – additional keyword arguments.

Return type:

None

async alazy_load() β†’ AsyncIterator[Document]#

A lazy loader for Documents.

Return type:

AsyncIterator[Document]

async aload() β†’ List[Document]#

Load data into Document objects.

Return type:

List[Document]

clean_pdf(contents: str) β†’ str[source]#

Clean the PDF file.

Parameters:

contents (str) – a PDF file contents.

Return type:

str

Returns:

get_processed_pdf(pdf_id: str) β†’ str[source]#
Parameters:

pdf_id (str) –

Return type:

str

lazy_load() β†’ Iterator[Document]#

A lazy loader for Documents.

Return type:

Iterator[Document]

load() β†’ List[Document][source]#

Load data into Document objects.

Return type:

List[Document]

load_and_split(text_splitter: TextSplitter | None = None) β†’ List[Document]#

Load Documents and split into chunks. Chunks are returned as Documents.

Do not override this method. It should be considered to be deprecated!

Parameters:

text_splitter (Optional[TextSplitter]) – TextSplitter instance to use for splitting documents. Defaults to RecursiveCharacterTextSplitter.

Returns:

List of Documents.

Return type:

List[Document]

send_pdf() β†’ str[source]#
Return type:

str

wait_for_processing(pdf_id: str) β†’ None[source]#

Wait for processing to complete.

Parameters:

pdf_id (str) – a PDF id.

Return type:

None

Returns: None

Examples using MathpixPDFLoader