Date
1 - 3 of 3
pdf to text, text order problem #gate-embedded
bem85@...
Hello,
I really need your help, I'm trying to extract information from some CVs which are PDF format. When i use gate the extracted text is not at the same order as the text in the pdf. I know that the problem comes from the pdfs which are created maybe with blocktext in a word document then saved as pdf. But is there any way that I can use with gate to extract the text in the same display order. I hope my question is clear and that you can help me. thank you |
|
Jan Dedek
Hi, I think I know the problem quite well. I think GATE (its PDF library) is extracting the text exactly in the "cursor order", which is the order of elements in the pdf file. You can see this in a graphical PDF viewer - e.g. Acrobat Reader when you place the cursor for text selection and move it with left or right cursor key. Sometimes this order is not the logical word order of the text. Especially in PDFs coming from OCR, even sentences are broken by this wrong text order. The good thing is that from PDF libraries you can obtain page coordinates for each letter. The bad thing is that GATE is not providing the coordinates with its standard libraries. We tried to create our own pdf to text converter in java and python, but the results were far from perfect. The problem of determining the correct word order is difficult (consider e.g. document headers, footers, columns, floating images with captions, etc.) and our converter was usually not better than the original order from the PDF. It was different but often not better. I wish you good luck and I hope that my comments may save you some work and time. Cheers, JD čt 31. 3. 2022 v 17:23 odesílatel <bem85@...> napsal: Hello, |
|
bem85@...
Hi Jan Dedek,
Yes it's exactly as you described the "cursor order". I'm so sad to know that there is not already a solution in Gate libraries because I don't have a lot of time to try to do something from scratch. I'll search more to see if I have another solution for this problem or if someone else can help it would be great! Thank you. |
|