Legacy Knowledge Base
Published Sep. 10, 2025

FileImpl "Unable to extract text from" and other errors during document indexation related to TIKA library

Written By

Jorge Diaz

How To articles are not official guidelines or officially supported documentation. They are community-contributed content and may not always reflect the latest updates to Liferay DXP. We welcome your feedback to improve How To articles!

While we make every effort to ensure this Knowledge Base is accurate, it may not always reflect the most recent updates or official guidelines.We appreciate your understanding and encourage you to reach out with any feedback or concerns.

Legacy Article

You are viewing an article from our legacy "FastTrack" publication program, made available for informational purposes. Articles in this program were published without a requirement for independent editing or verification and are provided"as is" without guarantee.

Before using any information from this article, independently verify its suitability for your situation and project.

Issue

If we execute a reindex of documents (DLFileEntry) we have several warnings and errors in the log file with the message "Unable to extract text from" and related to FileImpl and TikaException:

  • Tried to allocate an array of length 1133957, but 1000000 is the maximum for this record type.
2021-03-08T18:03:06.882+0100 WARN  [default-37][FileImpl:468] Unable to extract text from file_name
org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@5d15b90a
     at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:293)
     at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
     at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
     at org.apache.tika.Tika.parseToString(Tika.java:527)
     at com.liferay.portal.util.FileImpl._parseToString(FileImpl.java:1142)
     at com.liferay.portal.util.FileImpl.extractText(FileImpl.java:449)
     at com.liferay.portal.kernel.util.FileUtil.extractText(FileUtil.java:217)
[...]
     at com.liferay.portal.search.internal.SearchEngineInitializer.reindex(SearchEngineInitializer.java:190)
     at com.liferay.portal.search.internal.SearchEngineInitializer$1.call(SearchEngineInitializer.java:146)
     at com.liferay.portal.search.internal.SearchEngineInitializer$1.call(SearchEngineInitializer.java:139)
     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
     at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
     at co.elastic.apm.agent.impl.async.SpanInScopeRunnableWrapper.run(SpanInScopeRunnableWrapper.java:64)
     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
     at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an array of length 1133957, but 1000000 is the maximum for this record type._If the file is not corrupt, please open an issue on bugzilla to request _increasing the maximum allowable size for this record type._As a temporary workaround, consider setting a higher override value with IOUtils.setByteArrayMaxOverride()

or 

2021-03-30T21:43:11.878+0200 WARN  [default-6][FileImpl:478] Unable to extract text from file_name.docx
org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@2c5260e5
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:293)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
    at com.liferay.portal.util.FileImpl._parseToString(FileImpl.java:1183)
    at com.liferay.portal.util.FileImpl.extractText(FileImpl.java:459)
    at com.liferay.portal.kernel.util.FileUtil.extractText(FileUtil.java:217)
[...]
    at com.liferay.portal.search.internal.SearchEngineInitializer.reindex(SearchEngineInitializer.java:190)
    at com.liferay.portal.search.internal.SearchEngineInitializer$1.call(SearchEngineInitializer.java:146)
    at com.liferay.portal.search.internal.SearchEngineInitializer$1.call(SearchEngineInitializer.java:139)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at co.elastic.apm.agent.impl.async.SpanInScopeRunnableWrapper.run(SpanInScopeRunnableWrapper.java:64)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an array of length 2077184, but 2000000 is the maximum for this record type._If the file is not corrupt, please open an issue on bugzilla to request _increasing the maximum allowable size for this record type._As a temporary workaround, consider setting a higher override value with IOUtils.setByteArrayMaxOverride() [Sanitized]
    at org.apache.poi.util.IOUtils.throwRFE(IOUtils.java:630)
    at org.apache.poi.util.IOUtils.checkLength(IOUtils.java:205)
    at org.apache.poi.util.IOUtils.toByteArray(IOUtils.java:173)
    at org.apache.poi.util.IOUtils.toByteArray(IOUtils.java:149)
    at org.apache.poi.openxml4j.util.ZipArchiveFakeEntry.(ZipArchiveFakeEntry.java:47)
    at org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource.(ZipInputStreamZipEntrySource.java:53)
    at org.apache.poi.openxml4j.opc.ZipPackage.(ZipPackage.java:106)
    at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:307)
    at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:111)
    at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:113)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
    ... 41 more

In both cases, the log trace says that:

    • As a temporary workaround, consider setting a higher override value with IOUtils.setByteArrayMaxOverride()

so it would be necessary to change that setting in the Tika library.

 

  • Error parsing Matlab file with MatParser
2021-03-08T09:25:48.010+0100 WARN  [default-3][FileImpl:468] Unable to extract text from file_name.fig
org.apache.tika.exception.TikaException: Error parsing Matlab file with MatParser
    at org.apache.tika.parser.mat.MatParser.parse(MatParser.java:139)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
    at org.apache.tika.Tika.parseToString(Tika.java:527)
[...]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: com.jmatio.io.MatlabIOException: Incorrect matlab array class: function_handle
    at com.jmatio.io.MatFileReader.readMatrix(MatFileReader.java:930)

 

  • Errors caused by video formats that are not unsupported by Tika/MP4Parser:
2021-03-08T09:34:11.498+0100 ERROR [default-3][BasicContainer:122] Error during box parsing: java.lang.RuntimeException: box size of zero means 'till end of file. That is not yet supported
2021-03-08T09:34:11.499+0100 ERROR [default-3][BasicContainer:122] Error during box parsing: org.mp4parser.MemoryAllocationException: Tried to allocate 1937011819 bytes, but the limit for this record type is: 536870912. If you believe this file is not corrupt, please open a ticket on github to increase the maximum allowable size for this record type.
2021-03-08T09:34:11.499+0100 ERROR [default-3][BasicContainer:122] Error during box parsing: org.mp4parser.MemoryAllocationException: Tried to allocate 85899348400 bytes, but the limit for this record type is: 536870912. If you believe this file is not corrupt, please open a ticket on github to increase the maximum allowable size for this record type.
2021-03-08T09:34:11.499+0100 ERROR [default-3][BasicContainer:122] Error during box parsing: org.mp4parser.MemoryAllocationException: Tried to allocate 1668576363 bytes, but the limit for this record type is: 536870912. If you believe this file is not corrupt, please open a ticket on github to increase the maximum allowable size for this record type.

How can I disable the parsers that I don't want to use?  (for example the Matlab MatParser or the MP4Parser parsers?

How can I avoid the warnings or error traces related to the TIKA library?

Environment

  • Liferay DXP 7.0+

Resolution

You can avoid some of the errors changing the configuration of TIKA library, for example, you can:

The configuration of TIKA library is specified in the tika.xml file inside the portal-impl.jar, to override this configuration with your own one, you have to:

  • DXP 7.0-7.3:
  1. Extract the tika.xmlfile from the jar file: [LIFERAY_HOME]/tomcat-[version]/webapps/WEB-INF/lib/portal-impl.jar
    • You can open the jar file with any unzip tool (linux: unzip, windows: 7zip or winzip)
    • Inside the jar file, you will find the tika.xml file in the tika directory of the zip file, at tika/tika.xml
  2. Modify the extracted tika.xml file with the changes you need to apply. If you need more information see the links in the "Additional Information" section.
  3. Save the modified tika.xml file inside a custom configuration folder of your system, for example inside the [LIFERAY_HOME]
  4. Create a system-ext.properties inside [LIFERAY_HOME]/tomcat-[version]/webapps/WEB-INF/classes folder to add it to the DXP classpath. If you need more information on the system-ext.properties file creation, check the System-ext.properties Reference Guide;
  5. Add the following line to the system-ext.properties 
    • tika.config=/absolute_path_to_the_configuration_file/tika.xml
    • you have to replace the absolute_path_to_the_configuration_file text with your specific folder path to the tika.xml file
  6. Restart your Liferay server
  • DXP 7.4:
  1. Extract the tika.xmlfile from the jar file: [LIFERAY_HOME]/osgi/portal/com.liferay.portal.tika.jar 
    • You can open the jar file with any unzip tool (linux: unzip, windows: 7zip or winzip)
    • Inside the jar file, you will find the tika.xml file in the com/liferay/portal/tika/internal/configuration/helper/dependencies directory of the zip file.
  2. Modify the extracted tika.xml file with the changes you need to apply. If you need more information see the links in the "Additional Information" section.
  3. Save the modified tika.xml file inside [LIFERAY_HOME]/tomcat-[version]/webapps/WEB-INF/classes folder
  4. Go to System Settings => Infrastructure => Tika configuration and change the "Tika configuration xml" setting to tika.xml  (the relative path to the tika.xml file inside classes folder)
  5. Restart your Liferay server

Additional Information

Did this article resolve your issue ?

Legacy Knowledge Base