DocAIParser#
- class langchain_google_community.docai.DocAIParser(*, client: DocumentProcessorServiceClient | None = None, location: str | None = None, gcs_output_path: str | None = None, processor_name: str | None = None)[source]#
Google Cloud Document AI parser.
For a detailed explanation of Document AI, refer to the product documentation. https://cloud.google.com/document-ai/docs/overview
Initializes the parser.
- Parameters:
client (DocumentProcessorServiceClient | None) β a DocumentProcessorServiceClient to use
location (str | None) β a Google Cloud location where a Document AI processor is located
gcs_output_path (str | None) β a path on Google Cloud Storage to store parsing results
processor_name (str | None) β full resource name of a Document AI processor or processor version
- You should provide either a client or location (and then a client
would be instantiated).
Methods
__init__
(*[,Β client,Β location,Β ...])Initializes the parser.
batch_parse
(blobs[,Β gcs_output_path,Β ...])Parses a list of blobs lazily.
docai_parse
(blobs,Β *[,Β gcs_output_path,Β ...])Runs Google Document AI PDF Batch Processing on a list of blobs.
get_results
(operations)is_running
(operations)lazy_parse
(blob)Parses a blob lazily.
online_process
(blob[,Β ...])Parses a blob lazily using online processing.
operations_from_names
(operation_names)Initializes Long-Running Operations from their names.
parse
(blob)Eagerly parse the blob into a document or documents.
parse_from_results
(results)- __init__(*, client: DocumentProcessorServiceClient | None = None, location: str | None = None, gcs_output_path: str | None = None, processor_name: str | None = None)[source]#
Initializes the parser.
- Parameters:
client (DocumentProcessorServiceClient | None) β a DocumentProcessorServiceClient to use
location (str | None) β a Google Cloud location where a Document AI processor is located
gcs_output_path (str | None) β a path on Google Cloud Storage to store parsing results
processor_name (str | None) β full resource name of a Document AI processor or processor version
- You should provide either a client or location (and then a client
would be instantiated).
- batch_parse(blobs: Sequence[Blob], gcs_output_path: str | None = None, timeout_sec: int = 3600, check_in_interval_sec: int = 60) Iterator[Document] [source]#
Parses a list of blobs lazily.
- Parameters:
blobs (Sequence[Blob]) β a list of blobs to parse.
gcs_output_path (str | None) β a path on Google Cloud Storage to store parsing results.
timeout_sec (int) β a timeout to wait for Document AI to complete, in seconds.
check_in_interval_sec (int) β an interval to wait until next check whether parsing operations have been completed, in seconds
- Return type:
Iterator[Document]
- This is a long-running operation. A recommended way is to decouple
parsing from creating LangChain Documents: >>> operations = parser.docai_parse(blobs, gcs_path) >>> parser.is_running(operations) You can get operations names and save them: >>> names = [op.operation.name for op in operations] And when all operations are finished, you can use their results: >>> operations = parser.operations_from_names(operation_names) >>> results = parser.get_results(operations) >>> docs = parser.parse_from_results(results)
- docai_parse(blobs: Sequence[Blob], *, gcs_output_path: str | None = None, processor_name: str | None = None, batch_size: int = 1000, enable_native_pdf_parsing: bool = True, field_mask: str | None = None) List[Operation] [source]#
Runs Google Document AI PDF Batch Processing on a list of blobs.
- Parameters:
blobs (Sequence[Blob]) β a list of blobs to be parsed
gcs_output_path (str | None) β a path (folder) on GCS to store results
processor_name (str | None) β name of a Document AI processor.
batch_size (int) β amount of documents per batch
enable_native_pdf_parsing (bool) β a config option for the parser
field_mask (str | None) β a comma-separated list of which fields to include in the Document AI response. suggested: βtext,pages.pageNumber,pages.layoutβ
- Return type:
List[Operation]
Document AI has a 1000 file limit per batch, so batches larger than that need to be split into multiple requests. Batch processing is an async long-running operation and results are stored in a output GCS bucket.
- get_results(operations: List[Operation]) List[DocAIParsingResults] [source]#
- Parameters:
operations (List[Operation]) β
- Return type:
List[DocAIParsingResults]
- is_running(operations: List[Operation]) bool [source]#
- Parameters:
operations (List[Operation]) β
- Return type:
bool
- lazy_parse(blob: Blob) Iterator[Document] [source]#
Parses a blob lazily.
- This is a long-running operation. A recommended way is to batch
documents together and use the batch_parse() method.
- online_process(blob: Blob, enable_native_pdf_parsing: bool = True, field_mask: str | None = None, page_range: List[int] | None = None) Iterator[Document] [source]#
Parses a blob lazily using online processing.
- Parameters:
blob (Blob) β a blob to parse.
enable_native_pdf_parsing (bool) β enable pdf embedded text extraction
field_mask (str | None) β a comma-separated list of which fields to include in the Document AI response. suggested: βtext,pages.pageNumber,pages.layoutβ
page_range (List[int] | None) β list of page numbers to parse. If None, entire document will be parsed.
- Return type:
Iterator[Document]
- operations_from_names(operation_names: List[str]) List[Operation] [source]#
Initializes Long-Running Operations from their names.
- Parameters:
operation_names (List[str]) β
- Return type:
List[Operation]
- parse(blob: Blob) List[Document] #
Eagerly parse the blob into a document or documents.
This is a convenience method for interactive development environment.
Production applications should favor the lazy_parse method instead.
Subclasses should generally not over-ride this parse method.
- parse_from_results(results: List[DocAIParsingResults]) Iterator[Document] [source]#
- Parameters:
results (List[DocAIParsingResults]) β
- Return type:
Iterator[Document]