Document AI Utilities
zyx
provides a couple of utilities for working with documents & long text a little easier.
Chunking
Utilize the chunk
function for quick semantic chunking, with optional parallelization.
API Reference
Takes a string, Document, or a list of strings/Document models and returns the chunked content.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs
|
Union[str, Document, List[Union[str, Document]]]
|
Union[str, Document, List[Union[str, Document]]]: The input to chunk. |
required |
chunk_size
|
int
|
int: The size of the chunks to return. |
512
|
model
|
str
|
str: The model to use for chunking. |
'gpt-4'
|
processes
|
int
|
int: The number of processes to use for chunking. |
1
|
memoize
|
bool
|
bool: Whether to memoize the chunking process. |
True
|
progress
|
bool
|
bool: Whether to show a progress bar. |
False
|
max_token_chars
|
int
|
int: The maximum number of characters to use for chunking. |
None
|
Returns:
Type | Description |
---|---|
Union[List[str], List[List[str]]]
|
Union[List[str], List[List[str]]]: The chunked content. |
Source code in zyx/resources/data/chunk.py
Reading
Utilize the read
function for quick reading of most document types from both local file systems & the web.
Able to injest many documents at once, and return a list of Document
models.
from zyx import read
read("path/to/file.pdf")
# Document(content="...", metadata={"file_name": "file.pdf", "file_type": "application/pdf", "file_size": 123456})
API Reference
Reads either a file, a directory, or a list of files and returns the content.
Example
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
Union[str, Path, List[Union[str, Path]]]
|
Union[str, Path, List[Union[str, Path]]]: The path to read. |
required |
output
|
Union[Type[str], OutputFormat]
|
Union[Type[str], OutputFormat]: The output format. |
'document'
|
target
|
OutputType
|
OutputType: The output type. |
'text'
|
verbose
|
bool
|
bool: Whether to print verbose output. |
False
|
workers
|
Optional[int]
|
Optional[int]: The number of workers to use for reading. |
None
|
Returns:
Type | Description |
---|---|
Union[Document, List[Document], str, Dict, List[Dict]]
|
Union[Document, List[Document], str, Dict, List[Dict]]: The content. |