Use LLMs to eliminate manual processes involving unstructured data.
(see instructions)git
(see below)pyenv
(recommended to manage multiple Python versions)
Just run the
launch script to get started in few minutes.
The launch script does env setup with default values, pulls public docker images or builds them locally and finally runs them in containers.
# Pull and run entire Unstract platform with default env config.
# Pull and run docker containers with a specific version tag.
./ -v v0.1.0
# Build docker images locally and run with a specific version tag.
./ -b -v v0.1.0
# Display the help information.
./ -h
# Only do setup of environment files.
./ -e
# Only do docker images pull with a specific version tag.
./ -p -v v0.1.0
# Only do docker images pull by building locally with a specific version tag.
./ -p -b -v v0.1.0
# Pull and run docker containers in detached mode.
./ -d -v v0.1.0
Now visit http://frontend.unstract.localhost in your browser.
That's all. Enjoy!
The default username is unstract
and the default password is unstract
More details on configuring this can be found in backend's
Unstract predominantly works with PDF documents and it requires a Text Extractor
to be configured in the application which helps retrieve text from the documents. Currently supported text extractors include
- LLMWhisperer (works best)
- Unstructured Community
- Unstructured Enterprise
LLMWhisperer is our text extraction service which provides best results with Unstract.
- Create an account in the developer portal
- Create a
under your profile and copy thePrimary Key
- Try the APIs from the portal by passing the copied key in the request header
- This key needs to be passed in our application while creating an
LLM Whisperer Text Extractor
See Docker
- Install the below libraries which are needed to run Unstract
apt install build-essential libmagic-dev pandoc pkg-config tesseract-ocr
brew install freetds libmagic pkg-config poppler
All commands assumes that you have activated your venv
cd <service>
# Create venv
pdm venv create -w virtualenv --with-pip
eval "$(pdm venv activate in-project)"
# Remove venv
pdm venv remove in-project
PDM is used for dependency management.
# Install via script
curl -sSL | python3 -
# Install via pip
pip install pdm
Go to service dir and install dependencies listed in corresponding pyproject.toml
# Install dependencies
pdm install
# Install specific dev dependency group
pdm install --dev -G lint
# Install production dependencies only
pdm install --prod --no-editable
PDM allows you to run scripts applicable within the service dir.
# List the possible scripts that can be executed
pdm run -l
Add dependencies as follows.
# Add a new service dependency to ts pyproject.toml.
pdm add <package_from_PyPI>
# Add a relative path as an editable install.
pdm add -e <relative_path_to_local_package>
# List all dependencies.
pdm list
After modifying pyproject.toml
, the lock file can be updated as below.
pdm lock
See PDM's documentation for further details.
- Create a Postgres user and DB for the BE and configure it like so
POSTGRES_USER: unstract_dev
POSTGRES_PASSWORD: unstract_pass
POSTGRES_DB: unstract_db
If you require a different config, make sure the necessary envs from backend/sample.env are exported.
- We use
to run some hooks whenever code is pushed to perform linting and static code analysis among other checks. - Ensure dev dependencies are installed and you're in the virtual env
- Install hooks with
pre-commit install
orpdm run pre-commit install
- Manually trigger pre-commit hooks in following ways:
# # Using the tool directly # # Run all pre-commit hooks pre-commit run # Run specific pre-commit hook pre-commit run flake8 # Run mypy pre-commit hook for selected folder pre-commit run mypy --files prompt-service/**/*.py # Run mypy for selected folder mypy prompt-service/**/*.py # # Using pdm to run the scripts # # Run all pre-commit hooks pdm run pre-commit run # Run specific pre-commit hook pdm run pre-commit run flake8 # Run mypy pre-commit hook for selected folder pdm run pre-commit run mypy --files prompt-service/**/*.py # Run mypy for selected folder pdm run mypy prompt-service/**/*.py
- Check backend/ for running the backend.
- Install dependencies with
npm install
- Start the server with
npm start
It is possible to simultaneously run few services directly on docker host while others are run as docker containers via docker compose.
This enables seamless development without worrying about deployment of other services which you are not concerned with.
We just need to override default Traefik proxy routing to allow this, that's all.
Modify to update Traefik proxy routes for services running directly on docker host (host.docker.internal:<port>
). -
Update host name of dependency components in config of services running directly on docker host:
- Replace as
IF container port is exposed on docker host - OR use container IPs obtained via
docker network inspect unstract-network
- OR run
IF container port is NOT exposed on docker host or if you want to keep dependency host names unchanged
- Replace as
Run the services.
When same host name environment variables are used by both the service running locally and a service running in a container (for example, running in from a tool), host name resolution conflicts can arise for the following:
-> Using this inside a container points to the container itself, and not the
-> Meant to be used inside containers only, to get host IP. Does not make sense to use in services running locally.
In such cases, use another host name and point the same to host IP in /etc/hosts
For example, the backend uses the PROMPT_HOST environment variable, which is also supplied
in the Tool configuration when spawning Tool containers. If the backend is running
locally and the Tools are in containers, we could set the value to
and add it to /etc/hosts
as shown below.
<host_local_ip> prompt-service
An encryption key is used to securely encrypt and store data, for example credentials of connectors or adapters.
We make use of cryptography's Fernet to perform this encryption. Use this snippet to generate a key that can be set in your respective backend
and platform-service
ENCRYPTION_KEY=$(python -c "import secrets, base64; print(base64.urlsafe_b64encode(secrets.token_bytes(32)).decode())")