Issue
If we execute a reindex of documents (DLFileEntry) we have several warnings and errors in the log file with the message "Unable to extract text from" and related to FileImpl and TikaException:
- Tried to allocate an array of length 1133957, but 1000000 is the maximum for this record type.
2021-03-08T18:03:06.882+0100 WARN [default-37][FileImpl:468] Unable to extract text from file_name org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@5d15b90a at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:293) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) at org.apache.tika.Tika.parseToString(Tika.java:527) at com.liferay.portal.util.FileImpl._parseToString(FileImpl.java:1142) at com.liferay.portal.util.FileImpl.extractText(FileImpl.java:449) at com.liferay.portal.kernel.util.FileUtil.extractText(FileUtil.java:217) [...] at com.liferay.portal.search.internal.SearchEngineInitializer.reindex(SearchEngineInitializer.java:190) at com.liferay.portal.search.internal.SearchEngineInitializer$1.call(SearchEngineInitializer.java:146) at com.liferay.portal.search.internal.SearchEngineInitializer$1.call(SearchEngineInitializer.java:139) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at co.elastic.apm.agent.impl.async.SpanInScopeRunnableWrapper.run(SpanInScopeRunnableWrapper.java:64) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an array of length 1133957, but 1000000 is the maximum for this record type._If the file is not corrupt, please open an issue on bugzilla to request _increasing the maximum allowable size for this record type._As a temporary workaround, consider setting a higher override value with IOUtils.setByteArrayMaxOverride()
or
2021-03-30T21:43:11.878+0200 WARN [default-6][FileImpl:478] Unable to extract text from file_name.docx
org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@2c5260e5
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:293)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
at com.liferay.portal.util.FileImpl._parseToString(FileImpl.java:1183)
at com.liferay.portal.util.FileImpl.extractText(FileImpl.java:459)
at com.liferay.portal.kernel.util.FileUtil.extractText(FileUtil.java:217)
[...]
at com.liferay.portal.search.internal.SearchEngineInitializer.reindex(SearchEngineInitializer.java:190)
at com.liferay.portal.search.internal.SearchEngineInitializer$1.call(SearchEngineInitializer.java:146)
at com.liferay.portal.search.internal.SearchEngineInitializer$1.call(SearchEngineInitializer.java:139)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at co.elastic.apm.agent.impl.async.SpanInScopeRunnableWrapper.run(SpanInScopeRunnableWrapper.java:64)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an array of length 2077184, but 2000000 is the maximum for this record type._If the file is not corrupt, please open an issue on bugzilla to request _increasing the maximum allowable size for this record type._As a temporary workaround, consider setting a higher override value with IOUtils.setByteArrayMaxOverride() [Sanitized]
at org.apache.poi.util.IOUtils.throwRFE(IOUtils.java:630)
at org.apache.poi.util.IOUtils.checkLength(IOUtils.java:205)
at org.apache.poi.util.IOUtils.toByteArray(IOUtils.java:173)
at org.apache.poi.util.IOUtils.toByteArray(IOUtils.java:149)
at org.apache.poi.openxml4j.util.ZipArchiveFakeEntry.(ZipArchiveFakeEntry.java:47)
at org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource.(ZipInputStreamZipEntrySource.java:53)
at org.apache.poi.openxml4j.opc.ZipPackage.(ZipPackage.java:106)
at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:307)
at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:111)
at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:113)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
... 41 more
In both cases, the log trace says that:
-
- As a temporary workaround, consider setting a higher override value with IOUtils.setByteArrayMaxOverride()
so it would be necessary to change that setting in the Tika library.
- Error parsing Matlab file with MatParser
2021-03-08T09:25:48.010+0100 WARN [default-3][FileImpl:468] Unable to extract text from file_name.fig
org.apache.tika.exception.TikaException: Error parsing Matlab file with MatParser
at org.apache.tika.parser.mat.MatParser.parse(MatParser.java:139)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
at org.apache.tika.Tika.parseToString(Tika.java:527)
[...]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: com.jmatio.io.MatlabIOException: Incorrect matlab array class: function_handle
at com.jmatio.io.MatFileReader.readMatrix(MatFileReader.java:930)
- Errors caused by video formats that are not unsupported by Tika/MP4Parser:
2021-03-08T09:34:11.498+0100 ERROR [default-3][BasicContainer:122] Error during box parsing: java.lang.RuntimeException: box size of zero means 'till end of file. That is not yet supported
2021-03-08T09:34:11.499+0100 ERROR [default-3][BasicContainer:122] Error during box parsing: org.mp4parser.MemoryAllocationException: Tried to allocate 1937011819 bytes, but the limit for this record type is: 536870912. If you believe this file is not corrupt, please open a ticket on github to increase the maximum allowable size for this record type.
2021-03-08T09:34:11.499+0100 ERROR [default-3][BasicContainer:122] Error during box parsing: org.mp4parser.MemoryAllocationException: Tried to allocate 85899348400 bytes, but the limit for this record type is: 536870912. If you believe this file is not corrupt, please open a ticket on github to increase the maximum allowable size for this record type.
2021-03-08T09:34:11.499+0100 ERROR [default-3][BasicContainer:122] Error during box parsing: org.mp4parser.MemoryAllocationException: Tried to allocate 1668576363 bytes, but the limit for this record type is: 536870912. If you believe this file is not corrupt, please open a ticket on github to increase the maximum allowable size for this record type.
-
- These traces are related to the following TIKA and mp4parser issues:
How can I disable the parsers that I don't want to use? (for example the Matlab MatParser or the MP4Parser parsers?
How can I avoid the warnings or error traces related to the TIKA library?
Environment
- Liferay DXP 7.0+
Resolution
You can avoid some of the errors changing the configuration of TIKA library, for example, you can:
-
Change the specific configuration of a parser:
- For example, you can change the
byteArrayMaxOverrideparameter of the OfficeParser parser to avoid the "Tried to allocate an array of length..." warning - See external links: https://stackoverflow.com/questions/64221010/apache-tika-tried-to-allocate-an-array-of-length-1835606-but-1000000-is-the-ma or https://www.mail-archive.com/user@tika.apache.org/msg03054.html
- For example, you can change the
-
Remove a problematic parser:
- For example, you can remove the MatParser or the MP4Parser parsers to avoid the related errors
The configuration of TIKA library is specified in the tika.xml file inside the portal-impl.jar, to override this configuration with your own one, you have to:
- DXP 7.0-7.3:
-
Extract the
tika.xmlfile from the jar file:[LIFERAY_HOME]/tomcat-[version]/webapps/WEB-INF/lib/portal-impl.jar- You can open the jar file with any unzip tool (linux: unzip, windows: 7zip or winzip)
-
Inside the jar file, you will find the tika.xml file in the tika directory of the zip file, at
tika/tika.xml
-
Modify the extracted
tika.xmlfile with the changes you need to apply. If you need more information see the links in the "Additional Information" section. -
Save the modified
tika.xmlfile inside a custom configuration folder of your system, for example inside the [LIFERAY_HOME] -
Create a system-ext.properties inside
[LIFERAY_HOME]/tomcat-[version]/webapps/WEB-INF/classesfolder to add it to the DXP classpath. If you need more information on the system-ext.properties file creation, check the System-ext.properties Reference Guide; -
Add the following line to the system-ext.properties
-
tika.config=/absolute_path_to_the_configuration_file/tika.xml - you have to replace the absolute_path_to_the_configuration_file text with your specific folder path to the tika.xml file
-
- Restart your Liferay server
- DXP 7.4:
-
Extract the
tika.xmlfile from the jar file:[LIFERAY_HOME]/osgi/portal/com.liferay.portal.tika.jar- You can open the jar file with any unzip tool (linux: unzip, windows: 7zip or winzip)
- Inside the jar file, you will find the tika.xml file in the
com/liferay/portal/tika/internal/configuration/helper/dependenciesdirectory of the zip file.
-
Modify the extracted
tika.xmlfile with the changes you need to apply. If you need more information see the links in the "Additional Information" section. -
Save the modified
tika.xmlfile inside[LIFERAY_HOME]/tomcat-[version]/webapps/WEB-INF/classesfolder
- Go to System Settings => Infrastructure => Tika configuration and change the "Tika configuration xml" setting to tika.xml (the relative path to the tika.xml file inside
classesfolder) - Restart your Liferay server
Additional Information
- How to create and configure the system-ext.properties file
- How to configure the tika parsers in the tika.xml configuration file (external link)
- External information on how to change the
byteArrayMaxOverrideparameter of the OfficeParser parser: