Détail de l'autorité
SODUCO / Perret, Julien
Autorités liées :
Nom :
SODUCO
titre complet :
Dynamiques Sociales en contexte urbain: outils, modèles et données libres, Paris et ses banlieues,
URL du projet :
Auteurs :
Perret, Julien
|
Documents disponibles (10)



A benchmark of nested named entity recognition approaches in historical structured documents / Solenn Tual (2023)
![]()
Titre : A benchmark of nested named entity recognition approaches in historical structured documents Type de document : Article/Communication Auteurs : Solenn Tual , Auteur ; Nathalie Abadie
, Auteur ; Joseph Chazalon, Auteur ; Bertrand Duménieu
, Auteur ; Edwin Carlinet, Auteur
Editeur : Champs-sur-Marne [France] : Université Gustave Eiffel Année de publication : 2023 Projets : SODUCO / Perret, Julien Importance : 18 p. Format : 21 x 30 cm Note générale : Bibliographie Langues : Anglais (eng) Descripteur : [Vedettes matières IGN] Géomatique
[Termes IGN] langage naturel (informatique)
[Termes IGN] reconnaissance de noms
[Termes IGN] traitement du langage naturelRésumé : (Auteur) Named Entity Recognition (NER) is a key step in the creation of structured data from digitised historical documents. Traditional NER approaches deal with flat named entities, whereas entities often are nested. For example, a postal address might contain a street name and a number. This work compares three nested NER approaches, including two state-of-the-art approaches using Transformer-based architectures. We introduce a new Transformer-based approach based on joint labelling and semantic weighting of errors, evaluated on a collection of 19 th-century Paris trade directories. We evaluate approaches regarding the impact of supervised fine-tuning, unsupervised pre-training with noisy texts, and variation of IOB tagging formats. Our results show that while nested NER approaches enable extracting structured data directly, they do not benefit from the extra knowledge provided during training and reach a performance similar to the base approach on flat entities. Even though all 3 approaches perform well in terms of F1 scores, joint labelling is most suitable for hierarchically structured data. Finally, our experiments reveal the superiority of the IO tagging format on such data. Numéro de notice : P2023-001 Affiliation des auteurs : UGE-LASTIG+Ext (2020- ) Thématique : GEOMATIQUE/TOPONYMIE Nature : Preprint nature-HAL : Préprint DOI : sans Date de publication en ligne : 20/02/2023 En ligne : https://hal.science/hal-03994759v1/document Format de la ressource électronique : URL Article Permalink : https://documentation.ensg.eu/index.php?lvl=notice_display&id=102602 Entry separation using a mixed visual and textual language model: Application to 19th century French trade directories / Bertrand Duménieu (2023)
![]()
Titre : Entry separation using a mixed visual and textual language model: Application to 19th century French trade directories Type de document : Article/Communication Auteurs : Bertrand Duménieu , Auteur ; Edwin Carlinet, Auteur ; Nathalie Abadie
, Auteur ; Joseph Chazalon, Auteur
Editeur : Champs-sur-Marne [France] : Université Gustave Eiffel Année de publication : 2023 Projets : SODUCO / Perret, Julien Importance : 20 p. Format : 21 x 30 cm Note générale : Bibliographie Langues : Anglais (eng) Descripteur : [Vedettes matières IGN] Géomatique
[Termes IGN] annuaire
[Termes IGN] dix-neuvième siècle
[Termes IGN] modèle de langue
[Termes IGN] reconnaissance de nomsRésumé : (Auteur) When extracting structured data from repetitively organized documents, such as dictionaries, directories, or even newspapers, a key challenge is to correctly segment what constitutes the basic text regions for the target database. Traditionally, such a problem was tackled as part of the layout analysis and was mostly based on visual clues for dividing (top-down) approaches. Some agglomerating (bottom-up) approaches started to consider textual information to link similar contents, but they required a proper over-segmentation of ne-grained units. In this work, we propose a new pragmatic approach whose eciency is demonstrated on 19 th century French Trade Directories. We propose to consider two sub-problems: coarse layout detection (text columns and reading order), which is assumed to be eective and not detailed here, and a ne-grained entry separation stage for which we propose to adapt a state-of-the-art Named Entity Recognition (NER) approach. By injecting special visual tokens, coding, for instance, indentation or breaks, into the token stream of the language model used for NER purpose, we can leverage both textual and visual knowledge simultaneously. Code, data, results and models are available at https://github.com/soduco/ paper-entryseg-icdar23-code, https://huggingface.co/HueyNemud/ (icdar23-entrydetector* variants). Numéro de notice : P2023-002 Affiliation des auteurs : UGE-LASTIG+Ext (2020- ) Thématique : GEOMATIQUE/INFORMATIQUE/TOPONYMIE Nature : Preprint nature-HAL : Préprint DOI : sans Date de publication en ligne : 17/02/2023 En ligne : https://hal.science/hal-03994702v1/ Format de la ressource électronique : URL Article Permalink : https://documentation.ensg.eu/index.php?lvl=notice_display&id=102609 A benchmark of named entity recognition approaches in historical documents : application to 19th century French directories / Nathalie Abadie (2022)
![]()
Titre : A benchmark of named entity recognition approaches in historical documents : application to 19th century French directories Type de document : Article/Communication Auteurs : Nathalie Abadie , Auteur ; Edwin Carlinet, Auteur ; Joseph Chazalon, Auteur ; Bertrand Duménieu
, Auteur
Editeur : Berlin, Heidelberg, Vienne, New York, ... : Springer Année de publication : 2022 Collection : Lecture notes in Computer Science, ISSN 0302-9743 num. 13237 Projets : SODUCO / Perret, Julien Conférence : DAS 2022, 5th IAPR International Workshop on Document Analysis Systems 22/05/2022 25/05/2022 La Rochelle France Proceedings Springer Importance : pp 445 - 460 Note générale : bibliographie Langues : Anglais (eng) Descripteur : [Vedettes matières IGN] Géomatique
[Termes IGN] classification par réseau neuronal convolutif
[Termes IGN] dix-neuvième siècle
[Termes IGN] données d'entrainement (apprentissage automatique)
[Termes IGN] exploration de texte
[Termes IGN] objet géohistorique
[Termes IGN] reconnaissance de noms
[Termes IGN] traitement du langage naturelRésumé : (auteur) Named entity recognition (NER) is a necessary step in many pipelines targeting historical documents. Indeed, such natural language processing techniques identify which class each text token belongs to, e.g. “person name”, “location”, “number”. Introducing a new public dataset built from 19th century French directories, we first assess how noisy modern, off-the-shelf OCR are. Then, we compare modern CNN- and Transformer-based NER techniques which can be reasonably used in the context of historical document analysis. We measure their requirements in terms of training data, the effects of OCR noise on their performance, and show how Transformer-based NER can benefit from unsupervised pre-training and supervised fine-tuning on noisy data. Results can be reproduced using resources available at https://github.com/soduco/paper-ner-bench-das22 and https://zenodo.org/record/6394464. Numéro de notice : C2022-030 Affiliation des auteurs : UGE-LASTIG+Ext (2020- ) Autre URL associée : vers HAL Thématique : GEOMATIQUE/INFORMATIQUE Nature : Communication nature-HAL : ComAvecCL&ActesPubliésIntl DOI : 10.1007/978-3-031-06555-2_30 En ligne : http://dx.doi.org/10.1007/978-3-031-06555-2_30 Format de la ressource électronique : URL article Permalink : https://documentation.ensg.eu/index.php?lvl=notice_display&id=101088 BuyTheDips : PathLoss for improved topology-preserving deep learning-based image segmentation / Minh On Vu Ngoc (2022)
![]()
Titre : BuyTheDips : PathLoss for improved topology-preserving deep learning-based image segmentation Type de document : Article/Communication Auteurs : Minh On Vu Ngoc, Auteur ; Yizi Chen , Auteur ; Nicolas Boutry, Auteur ; Jonathan Fabrizio, Auteur ; Clément Mallet
, Auteur
Editeur : Ithaca [New York - Etats-Unis] : ArXiv - Université Cornell Année de publication : 2022 Projets : SODUCO / Perret, Julien Importance : 13 p. Note générale : bibliographie Langues : Anglais (eng) Descripteur : [Vedettes matières IGN] Traitement d'image optique
[Termes IGN] apprentissage profond
[Termes IGN] chemin le plus court, algorithme du
[Termes IGN] fonction de perte
[Termes IGN] image numérique
[Termes IGN] proximité sémantique
[Termes IGN] segmentation d'imageRésumé : (auteur) Capturing the global topology of an image is essential for proposing an accurate segmentation of its domain. However, most of existing segmentation methods do not preserve the initial topology of the given input, which is detrimental for numerous downstream object-based tasks. This is all the more true for deep learning models which most work at local scales. In this paper, we propose a new topology-preserving deep image segmentation method which relies on a new leakage loss: the Pathloss. Our method is an extension of the BALoss [1], in which we want to improve the leakage detection for better recovering the closeness property of the image segmentation. This loss allows us to correctly localize and fix the critical points (a leakage in the boundaries) that could occur in the predictions, and is based on a shortest-path search algorithm. This way, loss minimization enforces connectivity only where it is necessary and finally provides a good localization of the boundaries of the objects in the image. Moreover, according to our research, our Pathloss learns to preserve stronger elongated structure compared to methods without using topology-preserving loss. Training with our topological loss function, our method outperforms state-of-the-art topology-aware methods on two representative datasets of different natures: Electron Microscopy and Historical Map. Numéro de notice : P2022-005 Affiliation des auteurs : UGE-LASTIG+Ext (2020- ) Thématique : IMAGERIE/INFORMATIQUE Nature : Preprint nature-HAL : Préprint DOI : 10.48550/arXiv.2207.11446 En ligne : https://doi.org/10.48550/arXiv.2207.11446 Format de la ressource électronique : URL article Permalink : https://documentation.ensg.eu/index.php?lvl=notice_display&id=101338 Combining deep learning and mathematical morphology for historical map segmentation / Yizi Chen (2021)
![]()
Titre : Combining deep learning and mathematical morphology for historical map segmentation Type de document : Chapitre/Contribution Auteurs : Yizi Chen , Auteur ; Edwin Carlinet, Auteur ; Joseph Chazalon, Auteur ; Clément Mallet
, Auteur ; Bertrand Duménieu
, Auteur ; Julien Perret
, Auteur
Editeur : Berlin, Heidelberg, Vienne, New York, ... : Springer Année de publication : 2021 Collection : Lecture notes in Computer Science, ISSN 0302-9743 num. 12708 Projets : SODUCO / Perret, Julien Conférence : DGMM 2021, 1st International Joint Conference on Discrete Geometry and Mathematical Morphology 24/05/2021 27/05/2021 Uppsala Suède Proceedings Springer Importance : pp 79 - 92 Note générale : bibliographie Langues : Anglais (eng) Descripteur : [Vedettes matières IGN] Géomatique
[Termes IGN] analyse diachronique
[Termes IGN] apprentissage profond
[Termes IGN] carte ancienne
[Termes IGN] chaîne de traitement
[Termes IGN] classification par réseau neuronal convolutif
[Termes IGN] détection d'objet
[Termes IGN] données maillées
[Termes IGN] morphologie mathématique
[Termes IGN] vectorisationRésumé : (auteur) The digitization of historical maps enables the study of ancient, fragile, unique, and hardly accessible information sources. Main map features can be retrieved and tracked through the time for subsequent thematic analysis. The goal of this work is the vectorization step, i.e., the extraction of vector shapes of the objects of interest from raster images of maps. We are particularly interested in closed shape detection such as buildings, building blocks, gardens, rivers, etc. in order to monitor their temporal evolution. Historical map images present significant pattern recognition challenges. The extraction of closed shapes by using traditional Mathematical Morphology (MM) is highly challenging due to the overlapping of multiple map features and texts. Moreover, state-of-the-art Convolutional Neural Networks (CNN) are perfectly designed for content image filtering but provide no guarantee about closed shape detection. Also, the lack of textural and color information of historical maps makes it hard for CNN to detect shapes that are represented by only their boundaries. Our contribution is a pipeline that combines the strengths of CNN (efficient edge detection and filtering) and MM (guaranteed extraction of closed shapes) in order to achieve such a task. The evaluation of our approach on a public dataset shows its effectiveness for extracting the closed boundaries of objects in historical maps. Numéro de notice : H2021-001 Affiliation des auteurs : UGE-LASTIG+Ext (2020- ) Autre URL associée : vers HAL Thématique : GEOMATIQUE Nature : Chapître / contribution nature-HAL : ChOuvrScient DOI : 10.1007/978-3-030-76657-3_5 Date de publication en ligne : 16/05/2021 En ligne : https://hal.science/hal-03101578v1 Format de la ressource électronique : URL article Permalink : https://documentation.ensg.eu/index.php?lvl=notice_display&id=96739 PermalinkVectorization of historical maps using deep edge filtering and closed shape extraction / Yizi Chen (2021)
PermalinkDes empreintes cartographiques : restitution de données géohistoriques à partir de la Carte de France de Cassini, 1750-1789 / Bertrand Duménieu in Cartes & Géomatique, n° 241-242 (décembre 2019)
PermalinkA hidden Markov model for matching spatial networks / Benoit Costes in Journal of Spatial Information Science (JoSIS), n° 18 (2019)
PermalinkEngraved footprints from the past. Retrieving cartographic geohistorical data from the Cassini Carte de France, 1750-1789 / Bertrand Duménieu (2019)
Permalink