Descripteur
Documents disponibles dans cette catégorie (12)



Etendre la recherche sur niveau(x) vers le bas
A benchmark of nested named entity recognition approaches in historical structured documents / Solenn Tual (2023)
![]()
Titre : A benchmark of nested named entity recognition approaches in historical structured documents Type de document : Article/Communication Auteurs : Solenn Tual , Auteur ; Nathalie Abadie
, Auteur ; Joseph Chazalon, Auteur ; Bertrand Duménieu
, Auteur ; Edwin Carlinet, Auteur
Editeur : Champs-sur-Marne [France] : Université Gustave Eiffel Année de publication : 2023 Projets : SODUCO / Perret, Julien Importance : 18 p. Format : 21 x 30 cm Note générale : Bibliographie Langues : Anglais (eng) Descripteur : [Vedettes matières IGN] Géomatique
[Termes IGN] langage naturel (informatique)
[Termes IGN] reconnaissance de noms
[Termes IGN] traitement du langage naturelRésumé : (Auteur) Named Entity Recognition (NER) is a key step in the creation of structured data from digitised historical documents. Traditional NER approaches deal with flat named entities, whereas entities often are nested. For example, a postal address might contain a street name and a number. This work compares three nested NER approaches, including two state-of-the-art approaches using Transformer-based architectures. We introduce a new Transformer-based approach based on joint labelling and semantic weighting of errors, evaluated on a collection of 19 th-century Paris trade directories. We evaluate approaches regarding the impact of supervised fine-tuning, unsupervised pre-training with noisy texts, and variation of IOB tagging formats. Our results show that while nested NER approaches enable extracting structured data directly, they do not benefit from the extra knowledge provided during training and reach a performance similar to the base approach on flat entities. Even though all 3 approaches perform well in terms of F1 scores, joint labelling is most suitable for hierarchically structured data. Finally, our experiments reveal the superiority of the IO tagging format on such data. Numéro de notice : P2023-001 Affiliation des auteurs : UGE-LASTIG+Ext (2020- ) Thématique : GEOMATIQUE/TOPONYMIE Nature : Preprint nature-HAL : Préprint DOI : sans Date de publication en ligne : 20/02/2023 En ligne : https://hal.science/hal-03994759v1/document Format de la ressource électronique : URL Article Permalink : https://documentation.ensg.eu/index.php?lvl=notice_display&id=102602 Entry separation using a mixed visual and textual language model: Application to 19th century French trade directories / Bertrand Duménieu (2023)
![]()
Titre : Entry separation using a mixed visual and textual language model: Application to 19th century French trade directories Type de document : Article/Communication Auteurs : Bertrand Duménieu , Auteur ; Edwin Carlinet, Auteur ; Nathalie Abadie
, Auteur ; Joseph Chazalon, Auteur
Editeur : Champs-sur-Marne [France] : Université Gustave Eiffel Année de publication : 2023 Projets : SODUCO / Perret, Julien Importance : 20 p. Format : 21 x 30 cm Note générale : Bibliographie Langues : Anglais (eng) Descripteur : [Vedettes matières IGN] Géomatique
[Termes IGN] annuaire
[Termes IGN] dix-neuvième siècle
[Termes IGN] modèle de langue
[Termes IGN] reconnaissance de nomsRésumé : (Auteur) When extracting structured data from repetitively organized documents, such as dictionaries, directories, or even newspapers, a key challenge is to correctly segment what constitutes the basic text regions for the target database. Traditionally, such a problem was tackled as part of the layout analysis and was mostly based on visual clues for dividing (top-down) approaches. Some agglomerating (bottom-up) approaches started to consider textual information to link similar contents, but they required a proper over-segmentation of ne-grained units. In this work, we propose a new pragmatic approach whose eciency is demonstrated on 19 th century French Trade Directories. We propose to consider two sub-problems: coarse layout detection (text columns and reading order), which is assumed to be eective and not detailed here, and a ne-grained entry separation stage for which we propose to adapt a state-of-the-art Named Entity Recognition (NER) approach. By injecting special visual tokens, coding, for instance, indentation or breaks, into the token stream of the language model used for NER purpose, we can leverage both textual and visual knowledge simultaneously. Code, data, results and models are available at https://github.com/soduco/ paper-entryseg-icdar23-code, https://huggingface.co/HueyNemud/ (icdar23-entrydetector* variants). Numéro de notice : P2023-002 Affiliation des auteurs : UGE-LASTIG+Ext (2020- ) Thématique : GEOMATIQUE/INFORMATIQUE/TOPONYMIE Nature : Preprint nature-HAL : Préprint DOI : sans Date de publication en ligne : 17/02/2023 En ligne : https://hal.science/hal-03994702v1/ Format de la ressource électronique : URL Article Permalink : https://documentation.ensg.eu/index.php?lvl=notice_display&id=102609 Geographic named entity recognition by employing natural language processing and an improved BERT model / Liufeng Tao in ISPRS International journal of geo-information, vol 11 n° 12 (December 2022)
![]()
[article]
Titre : Geographic named entity recognition by employing natural language processing and an improved BERT model Type de document : Article/Communication Auteurs : Liufeng Tao, Auteur ; Zhong Xie, Auteur ; Dexin Xu, Auteur ; et al., Auteur Année de publication : 2022 Article en page(s) : n° 598 Note générale : bibliographie Langues : Anglais (eng) Descripteur : [Vedettes matières IGN] Géomatique
[Termes IGN] Chine
[Termes IGN] classification dirigée
[Termes IGN] classification par réseau neuronal récurrent
[Termes IGN] données issues des réseaux sociaux
[Termes IGN] données publiques
[Termes IGN] jeu de données
[Termes IGN] reconnaissance de caractères
[Termes IGN] reconnaissance de noms
[Termes IGN] test de performance
[Termes IGN] toponyme
[Termes IGN] traitement du langage naturelRésumé : (auteur) Toponym recognition, or the challenge of detecting place names that have a similar referent, is involved in a number of activities connected to geographical information retrieval and geographical information sciences. This research focuses on recognizing Chinese toponyms from social media communications. While broad named entity recognition methods are frequently used to locate places, their accuracy is hampered by the many linguistic abnormalities seen in social media posts, such as informal sentence constructions, name abbreviations, and misspellings. In this study, we describe a Chinese toponym identification model based on a hybrid neural network that was created with these linguistic inconsistencies in mind. Our method adds a number of improvements to a standard bidirectional recurrent neural network model to help with location detection in social media messages. We demonstrate the results of a wide-ranging evaluation of the performance of different supervised machine learning methods, which have the natural advantage of avoiding human design features. A set of controlled experiments with four test datasets (one constructed and three public datasets) demonstrates the performance of supervised machine learning that can achieve good results on the task, significantly outperforming seven baseline models. Numéro de notice : A2022 Affiliation des auteurs : non IGN Thématique : GEOMATIQUE Nature : Article DOI : 10.3390/ijgi11120598 Date de publication en ligne : 28/11/2022 En ligne : https://doi.org/10.3390/ijgi11120598 Format de la ressource électronique : URL article Permalink : https://documentation.ensg.eu/index.php?lvl=notice_display&id=102178
in ISPRS International journal of geo-information > vol 11 n° 12 (December 2022) . - n° 598[article]A benchmark of named entity recognition approaches in historical documents : application to 19th century French directories / Nathalie Abadie (2022)
![]()
Titre : A benchmark of named entity recognition approaches in historical documents : application to 19th century French directories Type de document : Article/Communication Auteurs : Nathalie Abadie , Auteur ; Edwin Carlinet, Auteur ; Joseph Chazalon, Auteur ; Bertrand Duménieu
, Auteur
Editeur : Berlin, Heidelberg, Vienne, New York, ... : Springer Année de publication : 2022 Collection : Lecture notes in Computer Science, ISSN 0302-9743 num. 13237 Projets : SODUCO / Perret, Julien Conférence : DAS 2022, 5th IAPR International Workshop on Document Analysis Systems 22/05/2022 25/05/2022 La Rochelle France Proceedings Springer Importance : pp 445 - 460 Note générale : bibliographie Langues : Anglais (eng) Descripteur : [Vedettes matières IGN] Géomatique
[Termes IGN] classification par réseau neuronal convolutif
[Termes IGN] dix-neuvième siècle
[Termes IGN] données d'entrainement (apprentissage automatique)
[Termes IGN] exploration de texte
[Termes IGN] objet géohistorique
[Termes IGN] reconnaissance de noms
[Termes IGN] traitement du langage naturelRésumé : (auteur) Named entity recognition (NER) is a necessary step in many pipelines targeting historical documents. Indeed, such natural language processing techniques identify which class each text token belongs to, e.g. “person name”, “location”, “number”. Introducing a new public dataset built from 19th century French directories, we first assess how noisy modern, off-the-shelf OCR are. Then, we compare modern CNN- and Transformer-based NER techniques which can be reasonably used in the context of historical document analysis. We measure their requirements in terms of training data, the effects of OCR noise on their performance, and show how Transformer-based NER can benefit from unsupervised pre-training and supervised fine-tuning on noisy data. Results can be reproduced using resources available at https://github.com/soduco/paper-ner-bench-das22 and https://zenodo.org/record/6394464. Numéro de notice : C2022-030 Affiliation des auteurs : UGE-LASTIG+Ext (2020- ) Autre URL associée : vers HAL Thématique : GEOMATIQUE/INFORMATIQUE Nature : Communication nature-HAL : ComAvecCL&ActesPubliésIntl DOI : 10.1007/978-3-031-06555-2_30 En ligne : http://dx.doi.org/10.1007/978-3-031-06555-2_30 Format de la ressource électronique : URL article Permalink : https://documentation.ensg.eu/index.php?lvl=notice_display&id=101088 NeuroTPR: A neuro‐net toponym recognition model for extracting locations from social media messages / Jimin Wang in Transactions in GIS, Vol 24 n° 3 (June 2020)
![]()
[article]
Titre : NeuroTPR: A neuro‐net toponym recognition model for extracting locations from social media messages Type de document : Article/Communication Auteurs : Jimin Wang, Auteur ; Yingjie Hu, Auteur ; Kenneth Joseph, Auteur Année de publication : 2020 Article en page(s) : pp 719 - 735 Note générale : bibliographie Langues : Anglais (eng) Descripteur : [Vedettes matières IGN] Géomatique web
[Termes IGN] catastrophe naturelle
[Termes IGN] données issues des réseaux sociaux
[Termes IGN] données localisées des bénévoles
[Termes IGN] flux de travaux
[Termes IGN] géolocalisation
[Termes IGN] précision sémantique
[Termes IGN] reconnaissance de noms
[Termes IGN] réseau neuronal récurrent
[Termes IGN] réseau social
[Termes IGN] toponymeRésumé : (auteur) Social media messages, such as tweets, are frequently used by people during natural disasters to share real‐time information and to report incidents. Within these messages, geographic locations are often described. Accurate recognition and geolocation of these locations are critical for reaching those in need. This article focuses on the first part of this process, namely recognizing locations from social media messages. While general named entity recognition tools are often used to recognize locations, their performance is limited due to the various language irregularities associated with social media text, such as informal sentence structures, inconsistent letter cases, name abbreviations, and misspellings. We present NeuroTPR, which is a Neuro‐net ToPonym Recognition model designed specifically with these linguistic irregularities in mind. Our approach extends a general bidirectional recurrent neural network model with a number of features designed to address the task of location recognition in social media messages. We also propose an automatic workflow for generating annotated data sets from Wikipedia articles for training toponym recognition models. We demonstrate NeuroTPR by applying it to three test data sets, including a Twitter data set from Hurricane Harvey, and comparing its performance with those of six baseline models. Numéro de notice : A2020-445 Affiliation des auteurs : non IGN Thématique : GEOMATIQUE Nature : Article nature-HAL : ArtAvecCL-RevueIntern DOI : 10.1111/tgis.12627 Date de publication en ligne : 14/05/2020 En ligne : https://doi.org/10.1111/tgis.12627 Format de la ressource électronique : url article Permalink : https://documentation.ensg.eu/index.php?lvl=notice_display&id=95508
in Transactions in GIS > Vol 24 n° 3 (June 2020) . - pp 719 - 735[article]Comparing supervised learning algorithms for Spatial Nominal Entity recognition / Amine Medad (2020)
PermalinkMapping urban fingerprints of odonyms automatically extracted from French novels / Ludovic Moncla in International journal of geographical information science IJGIS, vol 33 n° 12 (December 2019)
PermalinkA natural language processing and geospatial clustering framework for harvesting local place names from geotagged housing advertisements / Yingjie Hu in International journal of geographical information science IJGIS, Vol 33 n° 3-4 (March - April 2019)
PermalinkGeoTxt: A scalable geoparsing system for unstructured text geolocation / Morteza Karimzadeh in Transactions in GIS, vol 23 n° 1 (February 2019)
PermalinkRepérage et identification automatiques de noms de lieux avec variations d'écriture dans des corpus / Mathilde Jouvel-Triollet (2019)
PermalinkServices web pour l’annotation sémantique d’information spatiale à partir de corpus textuels / Ludovic Moncla in Revue internationale de géomatique, vol 28 n° 4 (octobre - décembre 2018)
PermalinkLinking spatial named entities to the web of data for geographical analysis of historical texts / Pierre-Henri Paris in Journal of Map & Geography Libraries, vol 13 n° 1 ([01/05/2017])
Permalink