Enabling Solr OCR

Solr OCR is available for Enterprise users and users with OCR licenses beginning with FileCloud Version 20.3.

When you enable OCR:

  • FileCloud's content search engine searches image files and PDF files for your search string. 
  • FileCloud's content classification engine (CCE) scans image files and PDFs for pattern-matching text.

Install and enable Solr OCR on Windows

Follow these instructions on Windows when performing a fresh installation of FileCloud or when performing an upgrade to the OCR component license.

  1. Upgrade to FileCloud 20.3 or higher.
  2. Open cloudconfig.php at  XAMPP DIRECTORY/htdocs/config/cloudconfig.php
  3. Add the following:

    define("TESSERACTOCR_BIN_DIR", "C:\\xampp\\tesseractocr");
    define("TESSERACTOCR_TESSDATA_DIR", "C:\\xampp\\tesseractocr\\tessdata");

    Note:
    TESSERACTOCR_BIN_DIR is the path to the TesseractOCR installation directory which contains the tesseract binary. In windows, this is typically at C:\xampp\tesseractocr\
    TESSERACTOCR_TESSDATA_DIR is the path to the TesseractOCR training data. In windows, this is typically at C:\xampp\tesseractocr\tessdata

  4. In the Admin portal, click Settings in the navigation pane, and then click the Content Search tab.

  5. If you are performing an upgrade, click Reset

     If you are performing a fresh installation, click Configure.
  6. Beside Enable Solr OCR, click the Enable button. 

    A confirmation box warns you that enabling OCR will require you to restart Solr.

  7. Click OK.
    A dialog box confirms that OCR is enabled and prompts you to restart Solr.

  8. Restart the Solr (Content Search) service from the FileCloud control panel.

  9. In the Admin portal, go to Settings, and click the Content Search tab again.
  10. Confirm that:
    • The Enable button is disabled
    • The message below the button says OCR has been successfully setup.
  11. To build or rebuild the search index with OCR for images with text and PDFs, under Managed Storage Index Status,
    • If you are performing a fresh installation, click Index.
    • If you are performing an upgrade, click Reindex.

Install and enable Solr OCR on Linux Ubuntu 

Follow these instructions on Linux when performing a fresh installation of FileCloud or when performing an upgrade to the OCR component license.

  1. Upgrade to FileCloud 20.3 or higher.
  2. Run filecloudcp -t
  3. In the Admin portal, click Settings in the navigation pane, and then click the Content Search tab.

  4. If you are performing an upgrade, click Reset and delete the current fccore if it exists (run command : rm -rf /opt/solrfcdata/var/solr/data/fccore/).
  5. Inspect the file solrconfig.xml inside /var/www/html/thirdparty/overrides/solarium/Solarium/fcskel/conf and uncomment the line containing parseContext.xml.
  6. In /var/www/html/thirdparty/overrides/solarium/Solarium, copy the folder fcskel into /opt/solrfcdata/var/solr/data (on the solr server) and rename it fccore.
    Note: For a multi-tenant setup, rename it fccore_ site name (for example, if site name is mysite, rename it fccore_mysite).
  7. In the Admin portal, go to Settings, and click the Content Search tab again.
  8. Click Configure.

  9. Confirm that the Enable button is disabled and the message below the button is OCR has been successfully setup.
  10. To build or rebuild the search index with OCR for images and PDFs with text, click Index.

Install and enable Solr OCR on other Linux distributions:

  1. To confirm that Tesseract is set up, enter:

    filecloudcp -t

    You should receive the response Tesseract is already installed and configured.

  2. To assign the Apache user (usually named www-data) to the solr group (for example solr:x:123) open /etc/group for edit, and append the apache user name to the solr group.

    solr:x:123:www-data
  3. Restart Apache.

    systemctl restart apache2
  4. Assign read and write permissions to the solr group for the Solr core directory of the site/tenant that OCR is being set up for.

    chmod -R g+rw /opt/solrfcdata/var/solr/data/fccore__<sitename> 
  5. In the FileCloud admin portal, go to Settings > Content Search, and click Enable next to Enable Solr OCR.
  6. Restart Solr.

    systemctl restart solr
  7. Reload the FileCloud Content Search screen.
    The note below the Enable button should say Image and PDF OCR is enabled.

Enable OCR manually

If your system is unable to configure OCR automatically, use the following instructions to enable it manually when performing a fresh installation of FileCloud or when performing an upgrade to the OCR component license.

  1. Upgrade to FileCloud 20.3 or higher
  2. Set the Tesseract environment variables:
    • For Windows, add the following to solr.in.cmd:

      SET PATH=%PATH%;C:\xampp\tesseractocr
      SET TESSDATA_PREFIX=C:\xampp\tesseractocr\tessdata
    • For Nix, add the following to to solr.in.sh (or define the environment variables globally)

      PATH="/path/to/tesseractocr:$PATH"
      TESSDATA_PREFIX=/path/to/tesseractocr/tessdata
  3. In the Admin portal, click Settings in the navigation pane, and then click the Content Search tab.
  4. If you are performing an upgrade, click Reset
    If you are performing a fresh installation, clicking Reset is not necessary.
  5. In C:\xampp\htdocs\thirdparty\overrides\solarium\Solarium copy the folder fcskel and rename it fccore.
    Then move it into C:\xampp\solr\server\solr.
  6. Restart the Solr (Content Search) service from the FileCloud control panel.
  7. In the Admin portal, go to Settings, and click the Content Search tab again.
  8. Confirm that the label beneath the Enable Solr OCR button says OCR has been successfully setup.
  9. To build or rebuild the search index with OCR for images with text and PDFs.
    • If you are performing a fresh installation, click Index.
    • If you are performing an upgrade, click Reindex.