Exporting GATE XML file to a DataBase (Neo4j)


Alexandre Dniestrowski <axd@...>
 

Hello everybody!
 
I am continuing my investigations on how to explore/navigate GATE´s annotations.

After saving GATE processing into an XML file, I am trying to import this XML file into a Neo4j graph relational database.

=> Does someone in the community has some similar experience on importing XML files into a database, particularly on 'mapping' issues?
 
Thanks for the attention - feedback will be highly greeted!
 
Cordially,
Alexandre


David Conlan
 

What is it you are trying to do?

I've fed the output of the gate into databases and into an elastic search db, but I tend to work in java directly with the annotations.
The problem with working with the full xml is there are often a lot of annotations I just don't need or care about in the final output.
Our pipeline is basically
load gate program
for each file to process
  process file in gate
  extract just annotations of interest
  send to db


On Mon, Feb 7, 2022 at 10:27 PM Alexandre Dniestrowski via groups.io <axd=laposte.net@groups.io> wrote:
Hello everybody!
 
I am continuing my investigations on how to explore/navigate GATE´s annotations.

After saving GATE processing into an XML file, I am trying to import this XML file into a Neo4j graph relational database.

=> Does someone in the community has some similar experience on importing XML files into a database, particularly on 'mapping' issues?
 
Thanks for the attention - feedback will be highly greeted!
 
Cordially,
Alexandre


sankar0453@...
 

Hi Alexandre, Just follow this spring embedded functionality to extract data from files.

Full example can be found in GATE training course materials:
https://gate.ac.uk/sale/talks/gate-course-jun18/module-8-developers/3-advanced-embedded/3-advanced-embedded-hands-on.zip


Alexandre Dniestrowski <axd@...>
 

Many thanks Sankar - I just noticed your feedback - I will look your indication - I had to postpone my work on XML export/import for this summer, due to other assignments.

Cordially,
Alexandre


Alexandre Dniestrowski <axd@...>
 

Many thanks David - I am sorry, just noticed your feedback.

You mention "extract just annotations of interest" - could you please elaborate on how you achieve this selective extraction?

Cordially,
Alexandre


David Conlan
 

Alexandre,

The way my code works is I define an array of strings which represents the data I want to extract.
They takes the form <AnnSetName>:<AnnType>:<feature>,  I allow '*' to match all and 'DEFAULT' is used for the default annotation set.
So "DEFAULT:Token:*" would map to all Token annotations in the default set and include all there features.

I then have a simplified pojo model which represents the annotations and use the jackson library to convert that to json format.

The pojo's look  like the classes below, feature values are forced into strings.


public class SummaryFeature{
 String name;
 String value;
}

public class SumarryAnnotation{
 String type;
 Integer start;
 Interger end;
 List<SummaryFeature> features;
}

public class SummaryAnnotationSet{
 String name;
List<SummaryAnnotation> annotations;
}
public class SummaryDocument{
  String id;
  List<SummaryAnnotationSet> annotationSets;
}

The resulting json looks like:

{
  "id": "file:test.txt",
  "annotationSets": [
    {
      "Annotations": [
        {
          "Type": "Notifiable",
          "Start": 0,
          "End": 5115,
          "Features": [
            {
              "Field": "Value",
              "Value": "Histotype Notifiable"
            }
          ]
        },
        {
          "Type": "HistologicalGrade",
          "Start": 1762,
          "End": 1788,
          "Features": [
            {
              "Field": "Section",
              "Value": "MICROSCOPIC"
            },
            {
              "Field": "Value",
              "Value": "2"
            }
          ]
        },
        {
          "Type": "HistologicalType",
          "Start": 1789,
          "End": 1803,
          "Features": [
            {
              "Field": "Section",
              "Value": "MICROSCOPIC"
            },
            {
              "Field": "Value",
              "Value": "M-81403"
            }
          ]
        }
      ],
      "name": "Medtex"
    }
  ]
}

Does that help?



Alexandre Dniestrowski <axd@...>
 

Good evening David!

Many thanks for your explanations - I feel that I guess more than I fully understand.

=> What is the "workflow context" of your code?
1- Is it a Gate Plugin (written in Java) that you plug into your processing pipeline?
2- Do you use JAPE schema for the processing?
3- The input is the Gate annotation database or another resource?
4- The output json file is produced by your code ?

Hope these questions are not too naive, but I am still learning how to add processing resources at Gate level and I have a looong road in front of me...

I thank you very much for your attention, your feedback will be of great value.

Cordially,
Alexandre


David Conlan
 

The system is used for processing pathology reports for cancer staging,

The architecture has a few moving parts, We have an message queue that is fed pathology reports. The gate processors takes reports off this queue, runs the gate pipeline and put the json result onto another message queue. Finally another job reads json results off the queue and updates the database.

The gate application that is run uses maybe 15 steps, a few of which are jape. custom plugins talk to a metamap server for extraction of umls concepts, a terminology server for snomedct queries and one step includes running a ML fasttext model trained for classification of sentences.




Alexandre Dniestrowski <axd@...>
 

Hello Dave!

Thanks for the explanations - I can better understand your code now.

Wish you a great (end of) week.

Cordially,
Alexandre