██████╗ ██╗███╗   ██╗███████╗
            ██╔══██╗██║████╗  ██║██╔════╝
            ██████╔╝██║██╔██╗ ██║█████╗  
            ██╔═══╝ ██║██║╚██╗██║██╔══╝  
            ██║     ██║██║ ╚████║███████╗
            ╚═╝     ╚═╝╚═╝  ╚═══╝╚══════╝
        Pmap Interface for Nlp Experimentation

About PINE

PINE is a web-based tool for text annotation. It enables annotation at the document level as well as over text spans (words). The annotation facilitates generation of natural language processing (NLP) models to classify documents and perform named entity recognition. Some of the features include:

Generate models in Spacy, OpenNLP, or CoreNLP on the fly and rank documents using Active Learning to reduce annotation time.
Extensible framework - add NLP pipelines of your choice.
Active Learning support - Out of the box active learning support (https://en.wikipedia.org/wiki/Active_learning_(machine_learning)) with pluggable active learning methods ranking functions.
Facilitates group annotation projects - view other people’s annotations, calculates inter-annotator agreement, displays annotation performance.
Enterprise authentication - integrate with your existing OAuth/Active Directory Servers.
Scalability - deploy in docker compose or a kubernetes cluster; ability to use database as a service such as CosmosDB.

PINE was developed under internal research and development (IRAD) funding at the Johns Hopkins University Applied Physics Laboratory. It was created to support the annotation needs of NLP tasks on the precision medicine analytics platform (PMAP) at Johns Hopkins.

Required Resources

Note - download required resources and place in pipelines/pine/pipelines/resources:

apache-opennlp-1.9.0
stanford-corenlp-full-2018-02-27
stanford-ner-2018-02-27

Alternatively, you can use the provided convenience script: ./pipelines/download_resources.sh

These are required to build docker images for active learning.

Development Environment

First, refer to the various README files in the subproject directories for dependencies. Alternatively, a convenience script is provided:

./setup_dev_stack.sh

Then a dev stack can be run with:

./run_dev_stack.py

You probably also need to update .env for VEGAS_CLIENT_SECRET, if you are planning to use that auth module.

The dev stack can be stopped with Ctrl-C.

Sometimes mongod doesn't seem to start in time. If you see a connection error for mongod, just close it and try it again.

Once the dev stack is up and running, the following ports are accessible:

localhost:4200 is the main entrypoint and hosts the web app
localhost:5000 hosts the backend
localhost:5001 hosts the eve layer

Generating documentation

See docs/README.md for information on required environment.
./generate_documentation.sh
Generated documentation can then be found in ./docs/build.

Testing Data

To import testing data, run the dev stack and then run:

./setup_dev_test_data.sh

WARNING: This script will remove any pre-existing data. If you need to clear your database for other reasons, stop your dev stack and then rm -rf local_data/dev/eve/db.

Testing

There are test cases written using Cypress; for more information, see test/README.md.

The short version, to run the tests using the docker-compose stack:

test/build.sh
test/run_docker_compose.sh --report
Check ./results/<timestamp> (the script in the previous step will print out the exact path) for:
- reports/report.html: an HTML report of tests run and their status
- screenshots/: for any screenshots from failed tests
- videos/: for videos of all the tests that were run

To use the interactive dashboard:

test/build.sh
test/run_docker_compose.sh --dashboard

It is also possible to run the cypress container directly, or locally with the dev stack. For more information, see test/README.md.

Versioning

There are three versions being tracked:

overall version: environment variable PINE_VERSION based on the git tag/revision information (see ./version.sh)
eve/database version: controlled in eve/python/settings.py
frontend version: controlled in frontend/annotation/package.json

The eve/database version should be bumped up when the schema changes. This will (eventually) be used to implement data migration.

The frontend version is the least important.

Using the copyright checking pre-commit hook

The script pre-commit is provided as a helpful utility to make sure that new files checked into the repository contain the copyright text. It is not automatically installed and must be installed manually:

ln -s ../../pre-commit .git/hooks/

This hook greps for the copyright text in new files and gives you the option to abort if it is not found.

Docker Environments

IMPORTANT:

For all the docker-compose environments, it is required to set a PINE_VERSION environment variable. To do this, either prepend each docker-compose command:

PINE_VERSION=$(./version.sh) docker-compose ...

Or export it in your shell:

export PINE_VERSION=$(./version.sh)
docker-compose ...

The docker environment is run using docker-compose. There are two supported configurations: the default and the prod configuration.

If desired, edit .env to change default variable values. You probably also need to update .env for VEGAS_CLIENT_SECRET, if you are planning to use that auth module.

To build the images for DEFAULT configuration:

docker-compose build

Or use the convenience script:

./run_docker_compose.sh --build

To run containers as daemons for DEFAULT configuration (remove -d flag to see logs):

docker-compose up

You may also want the --abort-on-container-exit flag which will make errors more apparent.

Or use the convenience script:

./run_docker_compose.sh --up

With default settings, the webapp will now be accessible at https://localhost:8888

Production Docker Environment

To use the production docker environment instead of the default, simply add -f docker-compose.yml -f docker-compose.prod.yml after the docker-compose command, e.g.:

docker-compose -f docker-compose.yml -f docker-compose.prod.yml build

Note that you probably need to update .env and add the MONGO_URI property.

Test data

To import test data, you need to run the docker-compose stack using the docker-compose.test.yml file:

docker-compose build
docker-compose -f docker-compose.yml -f docker-compose.override.yml -f docker-compose.test.yml up

Or use the convenience script:

./run_docker_compose.sh --build
./run_docker_compose.sh --up-test

Once the system is up and running:

./setup_docker_test_data.sh

Once the test data has been imported, you no longer need to use the docker-compose.test.yml file.

If you need to clear the database, bring down the container and remove the nlp_webapp_eve_db and nlp_webapp_eve_logs volumes with docker volume rm.

If you are migrating from very old PINE versions, it is possible that you need to migrate your data if you are seeing applications errors:

docker-compose exec eve python3 python/update_documents_annnotation_status.py

User management using "eve" auth module

Note: these scripts only apply to the "eve" auth module, which stores users in the eve database. Users in the "vegas" module are managed externally.

Once the system is up and running:

docker-compose exec backend scripts/data/list_users.sh

This script will reset all user passwords to their email:

docker-compose exec backend scripts/data/reset_user_passwords.sh

This script will add a new administrator:

docker-compose exec backend scripts/data/add_admin.sh <email username> <password>

This script will set a single user's password.

docker-compose exec backend scripts/data/set_user_password.sh <email username> <password>

Alternatively, there is an Admin Dashboard through the web interface.

Misc Configuration

Configuring Logging

See logging configuration files in ./shared/. logging.python.dev.json is used with the dev stack; the other files are used in the docker containers.

The docker-compose stack is currently set to bind the ./shared/ directory into the containers at run-time. This allows for configuration changes of the logging without needing to rebuild containers, and also allows the python logging config to live in one place instead of spread out into each container. This is controlled with the ${SHARED_VOLUME} variable from .env.

Log files will be stored in the ${LOGS_VOLUME} variable from .env. Pipeline models files will be stored in the ${MODELS_VOLUME} variable from ./env.

Collection/Document Images

It is now possible to explore images in the "annotate document" page in the frontend UI. The image URL is specified in the metadata field with the key imageUrl. If the URL starts with a "/" it is loaded from a special endpoint in the backend that loads from a locally attached volume. For docker, this volume is controlled by the DOCUMENT_IMAGE_VOLUME variable in .env. For running the dev stack, this volume can be found in ./local_data/dev/test_images.

To upload images outside the UI, the following procedures should be used:

All images in the collection should be in the directory <image volume>/by_collection/<collection ID>.
Subdirectories (such as for individual documents) are allowed but not mandatory.
The document metadata imageUrl should be set to /<image path within the collection directory>.
For example: an imageUrl of /image.jpg would load /<image volume>/by_collection/<collection ID>/image.jpg through the backend.

Languages

JavaScript 67%

Python 18%

TypeScript 8.6%

HTML 3.5%

Jupyter Notebook 1%

Other 1.8%

README.md Unescape Escape

PINE (Pmap Interface for Nlp Experimentation)

About PINE

Required Resources

Development Environment

Generating documentation

Testing Data

Testing

Versioning

Using the copyright checking pre-commit hook

Docker Environments

Production Docker Environment

Test data

User management using "eve" auth module

Misc Configuration

Configuring Logging

Collection/Document Images

README.md