ExperimentalMarkdownSyntaxTextSplitter#

class langchain_text_splitters.markdown.ExperimentalMarkdownSyntaxTextSplitter(headers_to_split_on: List[Tuple[str, str]] | None = None, return_each_line: bool = False, strip_headers: bool = True)[source]#

An experimental text splitter for handling Markdown syntax.

This splitter aims to retain the exact whitespace of the original text while extracting structured metadata, such as headers. It is a re-implementation of the MarkdownHeaderTextSplitter with notable changes to the approach and additional features.

Key Features: - Retains the original whitespace and formatting of the Markdown text. - Extracts headers, code blocks, and horizontal rules as metadata. - Splits out code blocks and includes the language in the β€œCode” metadata key. - Splits text on horizontal rules (β€”) as well. - Defaults to sensible splitting behavior, which can be overridden using the

headers_to_split_on parameter.

Parameters:#

headers_to_split_onList[Tuple[str, str]], optional

Headers to split on, defaulting to common Markdown headers if not specified.

return_each_linebool, optional

When set to True, returns each line as a separate chunk. Default is False.

Usage example:#

>>> headers_to_split_on = [
>>>     ("#", "Header 1"),
>>>     ("##", "Header 2"),
>>> ]
>>> splitter = ExperimentalMarkdownSyntaxTextSplitter(
>>>     headers_to_split_on=headers_to_split_on
>>> )
>>> chunks = splitter.split(text)
>>> for chunk in chunks:
>>>     print(chunk)

This class is currently experimental and subject to change based on feedback and further development.

Attributes

DEFAULT_HEADER_KEYS

Methods

__init__([headers_to_split_on,Β ...])

split_text(text)

__init__(headers_to_split_on: List[Tuple[str, str]] | None = None, return_each_line: bool = False, strip_headers: bool = True)[source]#
Parameters:
  • headers_to_split_on (List[Tuple[str, str]] | None) –

  • return_each_line (bool) –

  • strip_headers (bool) –

split_text(text: str) β†’ List[Document][source]#
Parameters:

text (str) –

Return type:

List[Document]

Parameters:
  • headers_to_split_on (Union[List[Tuple[str, str]], None]) –

  • return_each_line (bool) –

  • strip_headers (bool) –