Issue
- I have upgraded our DXP 7.3 update 30 to DXP 2023.Q4.4. After performing the upgrade, the indexing process seems to be stuck at 99%. It seems that it has failed, and we see several errors in the log like this:
ERROR [default-50][jericho:211] StartTag table at (r1,c2162,p2161) contains attribute name with invalid character at position (r1,c2266,p2265)
- We have a huge amount of web content (several thousand web content that are getting upgraded), and we found out that the indexer failed due to an invalid HTML in the web content. We have viewed this article: https://help.liferay.com/hc/en-us/articles/22875397024525-HTML-Parsing-errors-when-reindexing-journal-articles-jericho. However, the solution is not feasible because we require to keep all the old data.
Environment
- Liferay DXP 7.4 Quarterly Release 2023.Q4
Resolution
- The stuck indexing in this case is expected due to the huge data. It seems that it will take a longer time for the indexing process to be completed.
- However, to verify whether or not the reindex fails, I kindly ask you to raise the log level of the following to INFO: com.liferay.portal.search.elasticsearch7.internal.ElasticsearchIndexSearcher by opening the Main Menu on the top right. Go to Control Panel --> SYSTEM --> Server Administration --> Log Levels and click on the "+" on the upper right and add the category: com.liferay.portal.search.elasticsearch7.internal.ElasticsearchIndexSearcher
The log level should be set to INFO - After saving this log level, start a full reindex and check the log to see if it fails or not. In most cases, the indexing process will finish and you will see in the log this message:
INFO [liferay/background_task-7][ReindexPortalBackgroundTaskExecutor:81] Finished reindexing company xxxxx with execution mode full
- If the indexing gets stuck, you will need to index the latest versions of the web content. Please follow the below instructions:
- Uncheck the "Index All Article Versions Enabled" checkbox from Control Panel > System Settings > Content and Data > Web Content > Virtual Instance Scope > Web Content > Index All Article Versions Enabled. and save the configuration.
- Run the reindexCompanyWebContents.groovy script from the article. Make sure to change the "companyId"!
- Check the logs to gather all the problematic web content. The script will print out every web content during indexing, they need to gather the ones that have the error immediately after the
Indexing: articleId=
... message. - Review the HTML in these web contents and correct the issues.
- Reindex the web contents from Control Panel > Search > Web Content > Web Content Article (com.liferay.journal.model.JournalArticle).