UnstructuredLoader#

class langchain_unstructured.document_loaders.UnstructuredLoader(file_path: str | Path | list[str] | list[pathlib.Path] | None = None, *, file: IO[bytes] | list[IO[bytes]] | None = None, partition_via_api: bool = False, post_processors: list[Callable[[str], str]] | None = None, api_key: str | None = None, client: UnstructuredClient | None = None, url: str | None = None, **kwargs: Any)[source]#

Unstructured document loader interface.

Partition and load files using either the unstructured-client sdk and the Unstructured API or locally using the unstructured library.

API: This package is configured to work with the Unstructured API by default. To use the Unstructured API, set partition_via_api=True and define api_key. If you are running the unstructured API locally, you can change the API rule by defining url when you initialize the loader. The hosted Unstructured API requires an API key. See the links below to learn more about our API offerings and get an API key.

Local: To partition files locally, you must have the unstructured package installed. You can install it with pip install unstructured. By default the file loader uses the Unstructured partition function and will automatically detect the file type.

In addition to document specific partition parameters, Unstructured has a rich set of β€œchunking” parameters for post-processing elements into more useful text segments for uses cases such as Retrieval Augmented Generation (RAG). You can pass additional Unstructured kwargs to the loader to configure different unstructured settings.

Setup:
Instantiate:
Load:

References

https://docs.unstructured.io/api-reference/api-services/sdk https://docs.unstructured.io/api-reference/api-services/overview https://docs.unstructured.io/open-source/core-functionality/partitioning https://docs.unstructured.io/open-source/core-functionality/chunking

Initialize loader.

Methods

__init__([file_path,Β file,Β ...])

Initialize loader.

alazy_load()

A lazy loader for Documents.

aload()

Load data into Document objects.

lazy_load()

Load file(s) to the _UnstructuredBaseLoader.

load()

Load data into Document objects.

load_and_split([text_splitter])

Load Documents and split into chunks.

Parameters:
  • file_path (Optional[str | Path | list[str] | list[Path]]) –

  • file (Optional[IO[bytes] | list[IO[bytes]]]) –

  • partition_via_api (bool) –

  • post_processors (Optional[list[Callable[[str], str]]]) –

  • api_key (Optional[str]) –

  • client (Optional[UnstructuredClient]) –

  • url (Optional[str]) –

  • kwargs (Any) –

__init__(file_path: str | Path | list[str] | list[pathlib.Path] | None = None, *, file: IO[bytes] | list[IO[bytes]] | None = None, partition_via_api: bool = False, post_processors: list[Callable[[str], str]] | None = None, api_key: str | None = None, client: UnstructuredClient | None = None, url: str | None = None, **kwargs: Any)[source]#

Initialize loader.

Parameters:
  • file_path (str | Path | list[str] | list[pathlib.Path] | None) –

  • file (IO[bytes] | list[IO[bytes]] | None) –

  • partition_via_api (bool) –

  • post_processors (list[Callable[[str], str]] | None) –

  • api_key (str | None) –

  • client (UnstructuredClient | None) –

  • url (str | None) –

  • kwargs (Any) –

async alazy_load() β†’ AsyncIterator[Document]#

A lazy loader for Documents.

Return type:

AsyncIterator[Document]

async aload() β†’ List[Document]#

Load data into Document objects.

Return type:

List[Document]

lazy_load() β†’ Iterator[Document][source]#

Load file(s) to the _UnstructuredBaseLoader.

Return type:

Iterator[Document]

load() β†’ List[Document]#

Load data into Document objects.

Return type:

List[Document]

load_and_split(text_splitter: TextSplitter | None = None) β†’ List[Document]#

Load Documents and split into chunks. Chunks are returned as Documents.

Do not override this method. It should be considered to be deprecated!

Parameters:

text_splitter (Optional[TextSplitter]) – TextSplitter instance to use for splitting documents. Defaults to RecursiveCharacterTextSplitter.

Returns:

List of Documents.

Return type:

List[Document]