The Dr. Inventor Text Mining Framework is made of a pipeline of text mining modules, each one specialized in performing a specific analysis of the content of scientific publications. The following picture provides an overview of the sequence of text mining modules that are exploited to process papers both in PDF and JATS XML format.

Architetural overview of the Dr. Inventor Text Mining Framework

  • PDF to text converter: the conversion of PDF papers to structured textual documents can be performed by relying on PDFX or GROBID. More info at: Converting PDF to text.
  • Inline citation spotter: inline citations are identified inside the paper and linked to the corresponding bibliograpic entries. More info at: Analyzing citations.
  • Sentence splitter: sentences are identified inside a paper by means of a set of rules customized to scientific publications.
  • Web-based reference parser: bibliographic entries are parsed and enriched with external metadata by querying Bibsonomy, CrossRef and FreeCite. More info at: Analyzing citations.
  • Citation-aware dependency parser: the dependency tree is extracted from each sentence of the paper by taking into account inline citations with a syntactic role. More info at: Generating Subject-Verb-Object graphs.
  • PDF to text converter: the rhetorical class of each sentence (Challenge, Background, Approach, Outcome and Future Work) is automatically determined. More info at: Identifying the rhetorical role of sentences.
  • Babelfy WSD and Entity Linker: the Babelfy Word Sense Disambiguation and Entity Linking algorithm is applied to the paper by spotting Babelnet concepts. More info at: Annotating papers by Babelnet.
  • Coreference resolutor, causality spotter and graph builder: the subject-verb-object graph of the contents of the paper is built. More info at: Generating Subject-Verb-Object graphs.
  • Extractive summarizer: an extractive summary of the paper is generated by selecting and ordering the most relevant sentences. Both the summarization approach and the number of sentences to include in the summary can be manually defined. More info at: Summarizing papers.

The different scientific text analyses that are performed by the scientific text mining modules of the Dr. Inventor Framework are described into reater details at: