Onto Root Gazetteer


bluesunny
 

Hi, 
My project is about developing an ontology and processing annotation on social postings using the ontology.
I crawled hundreds of social posts in one csv file format and procedded annotations using OntoRoot gazetteer.
I want to analyze each social posts(each row) individually, but the GATE developer seems to recognize the csv file as a single document
Is there any waty to seperate rows in csv file?
Do I have to upload each single posts on LR individually? (so time consuming..)
Please help me with this problem.
Thank you!

Sincerely,
YR


Mark Greenwood
 

I'm not sure how you loaded the CSV file into GATE but if you use the CSV populator from the "Format: CSV" plugin, then you can specify that each row should be used to create a separate document. You can find full details of how to do this in the manual: https://gate.ac.uk/userguide/sec:creole:csv

Hope that helps,

Mark

On 22/09/2022 03:21, bluesunny06@... wrote:

Hi, 
My project is about developing an ontology and processing annotation on social postings using the ontology.
I crawled hundreds of social posts in one csv file format and procedded annotations using OntoRoot gazetteer.
I want to analyze each social posts(each row) individually, but the GATE developer seems to recognize the csv file as a single document
Is there any waty to seperate rows in csv file?
Do I have to upload each single posts on LR individually? (so time consuming..)
Please help me with this problem.
Thank you!

Sincerely,
YR


bluesunny
 

Dear Greenwood,

Thanks a lot for your reply. 
Actually, I changed the 'csv' file into 'xlsx' file format and  loaded that file on LR and made a corpus with it. 
Each row (individual post) is distinguished like this.



image.png


Is it necessary to use 'Format: CSV plugin' if I want to make each row to separate documents?

Sincerely yours,
YR


2022년 9월 22일 (목) 오후 3:52, Mark Greenwood via groups.io <m.a.greenwood=sheffield.ac.uk@groups.io>님이 작성:

I'm not sure how you loaded the CSV file into GATE but if you use the CSV populator from the "Format: CSV" plugin, then you can specify that each row should be used to create a separate document. You can find full details of how to do this in the manual: https://gate.ac.uk/userguide/sec:creole:csv

Hope that helps,

Mark

On 22/09/2022 03:21, bluesunny06@... wrote:
Hi, 
My project is about developing an ontology and processing annotation on social postings using the ontology.
I crawled hundreds of social posts in one csv file format and procedded annotations using OntoRoot gazetteer.
I want to analyze each social posts(each row) individually, but the GATE developer seems to recognize the csv file as a single document
Is there any waty to seperate rows in csv file?
Do I have to upload each single posts on LR individually? (so time consuming..)
Please help me with this problem.
Thank you!

Sincerely,
YR


Mark Greenwood
 

Yes, if you create a new document (i.e. right click on language resources and choose to make a new document) then you will always end up with a single document. If you want to take a single file and produce multiple documents then you need to first create a corpus, and then use a populator of some form (populators usually turn up on the right click menu of the corpus). The only one I know of that will work with the data you have would be to use the CSV populator as it specifically has an option for creating one document per row,

Mark

On 22/09/2022 08:29, yr Noh wrote:

Dear Greenwood,

Thanks a lot for your reply. 
Actually, I changed the 'csv' file into 'xlsx' file format and  loaded that file on LR and made a corpus with it. 
Each row (individual post) is distinguished like this.



image.png


Is it necessary to use 'Format: CSV plugin' if I want to make each row to separate documents?

Sincerely yours,
YR


2022년 9월 22일 (목) 오후 3:52, Mark Greenwood via groups.io <m.a.greenwood=sheffield.ac.uk@groups.io>님이 작성:

I'm not sure how you loaded the CSV file into GATE but if you use the CSV populator from the "Format: CSV" plugin, then you can specify that each row should be used to create a separate document. You can find full details of how to do this in the manual: https://gate.ac.uk/userguide/sec:creole:csv

Hope that helps,

Mark

On 22/09/2022 03:21, bluesunny06@... wrote:
Hi, 
My project is about developing an ontology and processing annotation on social postings using the ontology.
I crawled hundreds of social posts in one csv file format and procedded annotations using OntoRoot gazetteer.
I want to analyze each social posts(each row) individually, but the GATE developer seems to recognize the csv file as a single document
Is there any waty to seperate rows in csv file?
Do I have to upload each single posts on LR individually? (so time consuming..)
Please help me with this problem.
Thank you!

Sincerely,
YR


bluesunny
 

Hello Greenwood,
As you told me, I tried 'Populate from CSV file' on my corpus and got an error message like below.
image.png

image.png

So, I tried to change 'Quote Character' from " to ' and have succeeded to convert multiple rows of my csv file to multiple separate documents, but the text was somewhat is missing after the comma like this.
image.png

image.png




This is the whole text of the row that I mentioned in my csv file. As you can see,  the content is missing after the first comma.
image.png
   
Can you suggest any solution for this? I found out that all the text rows in the csv file ended with " but have no idea how to solve this problem.
Thank you very much for your help.

Sincerely yours,
 

2022년 9월 22일 (목) 오후 4:31, Mark Greenwood via groups.io <m.a.greenwood=sheffield.ac.uk@groups.io>님이 작성:

Yes, if you create a new document (i.e. right click on language resources and choose to make a new document) then you will always end up with a single document. If you want to take a single file and produce multiple documents then you need to first create a corpus, and then use a populator of some form (populators usually turn up on the right click menu of the corpus). The only one I know of that will work with the data you have would be to use the CSV populator as it specifically has an option for creating one document per row,

Mark

On 22/09/2022 08:29, yr Noh wrote:
Dear Greenwood,

Thanks a lot for your reply. 
Actually, I changed the 'csv' file into 'xlsx' file format and  loaded that file on LR and made a corpus with it. 
Each row (individual post) is distinguished like this.



image.png


Is it necessary to use 'Format: CSV plugin' if I want to make each row to separate documents?

Sincerely yours,
YR


2022년 9월 22일 (목) 오후 3:52, Mark Greenwood via groups.io <m.a.greenwood=sheffield.ac.uk@groups.io>님이 작성:

I'm not sure how you loaded the CSV file into GATE but if you use the CSV populator from the "Format: CSV" plugin, then you can specify that each row should be used to create a separate document. You can find full details of how to do this in the manual: https://gate.ac.uk/userguide/sec:creole:csv

Hope that helps,

Mark

On 22/09/2022 03:21, bluesunny06@... wrote:
Hi, 
My project is about developing an ontology and processing annotation on social postings using the ontology.
I crawled hundreds of social posts in one csv file format and procedded annotations using OntoRoot gazetteer.
I want to analyze each social posts(each row) individually, but the GATE developer seems to recognize the csv file as a single document
Is there any waty to seperate rows in csv file?
Do I have to upload each single posts on LR individually? (so time consuming..)
Please help me with this problem.
Thank you!

Sincerely,
YR


Ian Roberts
 

I can't tell for sure but from the look of those screenshots it appears that the lines are not 'ended with " ', but in fact the closing double quote that is supposed to terminate the quoted string values has at some point been converted into a "curly quote" or "smart quote" instead of a true double quote character.  This means that the CSV parser does not see the character as a proper terminator for the quoted string, and thus complains that the quoted field has not been properly terminated.  If you convert the curly ” back into a normal " then it should load properly.

"CSV" is a specific file format with rules that the data must follow:
  • Values are separated by the "column separator" (by default a comma character)
  • The "quote character" (by default double quote) must be placed around any value that contains the column separator, the quote character itself, or a line break.  Items that do not contain any of these may also be quoted but do not have to be.
  • Quote characters within a quoted string must be doubled (i.e. this is a "quoted" example -> "this is a ""quoted"" example")
If what you have is not really a "CSV file" that follows these quoting rules but just a text file with one item per line and no line breaks within a single item, then you should still be able to import it using the CSV populator with a little trick.  You would need to change the column separator and quote character to some obscure Unicode characters that are guaranteed not to appear anywhere in any of the actual values, such as \uE100 and \uE101 (a couple of random characters I've pulled from the "private use" area of the Unicode table - the column separator and quote character boxes accept these \uNNNN escape sequences).  That way the CSV reader will see each line of the file as a single "column" and not get confused by mismatched quotes.

Ian

On 25/09/2022 07:42, bluesunny wrote:
Hello Greenwood,
As you told me, I tried 'Populate from CSV file' on my corpus and got an error message like below.
image.png

image.png

So, I tried to change 'Quote Character' from " to ' and have succeeded to convert multiple rows of my csv file to multiple separate documents, but the text was somewhat is missing after the comma like this.
image.png

image.png




This is the whole text of the row that I mentioned in my csv file. As you can see,  the content is missing after the first comma.
image.png
   
Can you suggest any solution for this? I found out that all the text rows in the csv file ended with " but have no idea how to solve this problem.
Thank you very much for your help.

Sincerely yours,
 

2022년 9월 22일 (목) 오후 4:31, Mark Greenwood via groups.io <m.a.greenwood=sheffield.ac.uk@groups.io>님이 작성:

Yes, if you create a new document (i.e. right click on language resources and choose to make a new document) then you will always end up with a single document. If you want to take a single file and produce multiple documents then you need to first create a corpus, and then use a populator of some form (populators usually turn up on the right click menu of the corpus). The only one I know of that will work with the data you have would be to use the CSV populator as it specifically has an option for creating one document per row,

Mark

On 22/09/2022 08:29, yr Noh wrote:
Dear Greenwood,

Thanks a lot for your reply. 
Actually, I changed the 'csv' file into 'xlsx' file format and  loaded that file on LR and made a corpus with it. 
Each row (individual post) is distinguished like this.



image.png


Is it necessary to use 'Format: CSV plugin' if I want to make each row to separate documents?

Sincerely yours,
YR


2022년 9월 22일 (목) 오후 3:52, Mark Greenwood via groups.io <m.a.greenwood=sheffield.ac.uk@groups.io>님이 작성:

I'm not sure how you loaded the CSV file into GATE but if you use the CSV populator from the "Format: CSV" plugin, then you can specify that each row should be used to create a separate document. You can find full details of how to do this in the manual: https://gate.ac.uk/userguide/sec:creole:csv

Hope that helps,

Mark

On 22/09/2022 03:21, bluesunny06@... wrote:
Hi, 
My project is about developing an ontology and processing annotation on social postings using the ontology.
I crawled hundreds of social posts in one csv file format and procedded annotations using OntoRoot gazetteer.
I want to analyze each social posts(each row) individually, but the GATE developer seems to recognize the csv file as a single document
Is there any waty to seperate rows in csv file?
Do I have to upload each single posts on LR individually? (so time consuming..)
Please help me with this problem.
Thank you!

Sincerely,
YR


-- 
Ian Roberts               | Department of Computer Science
i.roberts@...  | University of Sheffield, UK