Prevent GATE from terminating when it reads in a letter with illegal characters / empty


Lacey A.S.
 

Hi - some .docx files we are reading in get parsed through Tika in GATE and there are some illegal characters or the file is empty. GATE will terminate mid-way through a corpus and we were wondering if there is anyway to prevent this behavior? We are processing 10,000s of letters using the BatchProcessApp.java - are there any tips on how to handle the exception other than terminating the program. Although it may be a different reason for termination, we believe it is always due to illegal characters in the doc files. Here is a sample error:


--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
An error occurred processing document 'some_doc.doc'. This was document 3 of 3 in the 'ds f' corpus. See the log for details
java.lang.IllegalArgumentException: fromKey > toKey
at gate.util.RBTreeMap$SubMap.<init>(RBTreeMap.java:833)
at gate.util.RBTreeMap.subMap(RBTreeMap.java:751)
at gate.annotation.AnnotationSetImpl.get(AnnotationSetImpl.java:640)
at gate.annotation.AnnotationSetImpl.get(AnnotationSetImpl.java:559)
at gate.context.ContextFeaturesTagger.assignContextFeatures(ContextFeaturesTagger.java:219)
at gate.context.ContextFeaturesTagger.execute(ContextFeaturesTagger.java:146)
at gate.util.Benchmark.executeWithBenchmarking(Benchmark.java:293)
at gate.creole.ConditionalSerialController.runComponent(ConditionalSerialController.java:172)
at gate.creole.SerialController.executeImpl(SerialController.java:158)
at gate.creole.ConditionalSerialAnalyserController.executeImpl(ConditionalSerialAnalyserController.java:225)
at gate.creole.ConditionalSerialAnalyserController.execute(ConditionalSerialAnalyserController.java:132)
at gate.util.Benchmark.executeWithBenchmarking(Benchmark.java:293)
at gate.gui.SerialControllerEditor$RunAction$1.run(SerialControllerEditor.java:1778)
at java.base/java.lang.Thread.run(Thread.java:832)
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------


Ian Roberts
 

The fix involving minimal change to your current workflow would be to convert (or wrap) your pipeline into a "realtime" controller, which has options to suppress exceptions that occur during processing.  But a more principled answer is to ask if you'd consider switching from BatchProcessApp to GCP?

BatchProcessApp was written more as an example of how you can use GATE Embedded from Java code, rather than as a serious tool to use for large scale processing - GCP is the tool that is designed for batch processing, it can handle more kinds of inputs and outputs, it handles failures like this more gracefully (simply marking the document as "failed" in the final report rather than crashing the whole java process) and in the event that a GCP batch does crash you can re-run it with the same parameters and it will read the report file from the first run and continue from where it left off.

Ian

On 10/06/2022 13:35, Lacey A.S. wrote:
Hi - some .docx files we are reading in get parsed through Tika in GATE and there are some illegal characters or the file is empty. GATE will terminate mid-way through a corpus and we were wondering if there is anyway to prevent this behavior? We are processing 10,000s of letters using the BatchProcessApp.java - are there any tips on how to handle the exception other than terminating the program. Although it may be a different reason for termination, we believe it is always due to illegal characters in the doc files. Here is a sample error:


--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
An error occurred processing document 'some_doc.doc'. This was document 3 of 3 in the 'ds f' corpus. See the log for details
java.lang.IllegalArgumentException: fromKey > toKey
at gate.util.RBTreeMap$SubMap.<init>(RBTreeMap.java:833)
at gate.util.RBTreeMap.subMap(RBTreeMap.java:751)
at gate.annotation.AnnotationSetImpl.get(AnnotationSetImpl.java:640)
at gate.annotation.AnnotationSetImpl.get(AnnotationSetImpl.java:559)
at gate.context.ContextFeaturesTagger.assignContextFeatures(ContextFeaturesTagger.java:219)
at gate.context.ContextFeaturesTagger.execute(ContextFeaturesTagger.java:146)
at gate.util.Benchmark.executeWithBenchmarking(Benchmark.java:293)
at gate.creole.ConditionalSerialController.runComponent(ConditionalSerialController.java:172)
at gate.creole.SerialController.executeImpl(SerialController.java:158)
at gate.creole.ConditionalSerialAnalyserController.executeImpl(ConditionalSerialAnalyserController.java:225)
at gate.creole.ConditionalSerialAnalyserController.execute(ConditionalSerialAnalyserController.java:132)
at gate.util.Benchmark.executeWithBenchmarking(Benchmark.java:293)
at gate.gui.SerialControllerEditor$RunAction$1.run(SerialControllerEditor.java:1778)
at java.base/java.lang.Thread.run(Thread.java:832)
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------


-- 
Ian Roberts               | Department of Computer Science
i.roberts@...  | University of Sheffield, UK