Date
1 - 2 of 2
Prevent GATE from terminating when it reads in a letter with illegal characters / empty
Lacey A.S.
Hi - some .docx files we are reading in get parsed through Tika in GATE and there are some illegal characters or the file is empty. GATE will terminate mid-way through a corpus and we were wondering if there is anyway to prevent this behavior? We are processing 10,000s of letters using the BatchProcessApp.java - are there any tips on how to handle the exception other than terminating the program. Although it may be a different reason for termination, we believe it is always due to illegal characters in the doc files. Here is a sample error:
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
An error occurred processing document 'some_doc.doc'. This was document 3 of 3 in the 'ds f' corpus. See the log for details
java.lang.IllegalArgumentException: fromKey > toKey
at gate.util.RBTreeMap$SubMap.<init>(RBTreeMap.java:833)
at gate.util.RBTreeMap.subMap(RBTreeMap.java:751)
at gate.annotation.AnnotationSetImpl.get(AnnotationSetImpl.java:640)
at gate.annotation.AnnotationSetImpl.get(AnnotationSetImpl.java:559)
at gate.context.ContextFeaturesTagger.assignContextFeatures(ContextFeaturesTagger.java:219)
at gate.context.ContextFeaturesTagger.execute(ContextFeaturesTagger.java:146)
at gate.util.Benchmark.executeWithBenchmarking(Benchmark.java:293)
at gate.creole.ConditionalSerialController.runComponent(ConditionalSerialController.java:172)
at gate.creole.SerialController.executeImpl(SerialController.java:158)
at gate.creole.ConditionalSerialAnalyserController.executeImpl(ConditionalSerialAnalyserController.java:225)
at gate.creole.ConditionalSerialAnalyserController.execute(ConditionalSerialAnalyserController.java:132)
at gate.util.Benchmark.executeWithBenchmarking(Benchmark.java:293)
at gate.gui.SerialControllerEditor$RunAction$1.run(SerialControllerEditor.java:1778)
at java.base/java.lang.Thread.run(Thread.java:832)
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Ian Roberts
The fix involving minimal change to
your current workflow would be to convert (or wrap) your pipeline
into a "realtime" controller, which has options to suppress
exceptions that occur during processing. But a more principled
answer is to ask if you'd consider switching from BatchProcessApp
to GCP?
BatchProcessApp was written more as an
example of how you can use GATE Embedded from Java code, rather
than as a serious tool to use for large scale processing - GCP is
the tool that is designed for batch processing, it can
handle more kinds of inputs and outputs, it handles failures like
this more gracefully (simply marking the document as "failed" in
the final report rather than crashing the whole java process) and
in the event that a GCP batch does crash you can re-run it with
the same parameters and it will read the report file from the
first run and continue from where it left off.
Ian
On 10/06/2022 13:35, Lacey A.S. wrote:
Hi - some .docx files we are reading in get parsed through Tika in GATE and there are some illegal characters or the file is empty. GATE will terminate mid-way through a corpus and we were wondering if there is anyway to prevent this behavior? We are processing 10,000s of letters using the BatchProcessApp.java - are there any tips on how to handle the exception other than terminating the program. Although it may be a different reason for termination, we believe it is always due to illegal characters in the doc files. Here is a sample error:
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
An error occurred processing document 'some_doc.doc'. This was document 3 of 3 in the 'ds f' corpus. See the log for detailsjava.lang.IllegalArgumentException: fromKey > toKeyat gate.util.RBTreeMap$SubMap.<init>(RBTreeMap.java:833)at gate.util.RBTreeMap.subMap(RBTreeMap.java:751)at gate.annotation.AnnotationSetImpl.get(AnnotationSetImpl.java:640)at gate.annotation.AnnotationSetImpl.get(AnnotationSetImpl.java:559)at gate.context.ContextFeaturesTagger.assignContextFeatures(ContextFeaturesTagger.java:219)at gate.context.ContextFeaturesTagger.execute(ContextFeaturesTagger.java:146)at gate.util.Benchmark.executeWithBenchmarking(Benchmark.java:293)at gate.creole.ConditionalSerialController.runComponent(ConditionalSerialController.java:172)at gate.creole.SerialController.executeImpl(SerialController.java:158)at gate.creole.ConditionalSerialAnalyserController.executeImpl(ConditionalSerialAnalyserController.java:225)at gate.creole.ConditionalSerialAnalyserController.execute(ConditionalSerialAnalyserController.java:132)at gate.util.Benchmark.executeWithBenchmarking(Benchmark.java:293)at gate.gui.SerialControllerEditor$RunAction$1.run(SerialControllerEditor.java:1778)at java.base/java.lang.Thread.run(Thread.java:832)----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-- Ian Roberts | Department of Computer Science i.roberts@... | University of Sheffield, UK