Natural Language Processing (NLP) has seen remarkable advances in recent years, particularly with the emergence of Large Language Models that have achieved unprecedented performance across many tasks. However, these developments have mainly benefited a small number of high-resource languages such as English. The majority of languages still face significant challenges due to the scarcity of training data and computational resources. To address this issue, this thesis focuses on cross-lingual transfer learning, a research area aimed at leveraging data and models from high-resource languages to improve NLP performance for low-resource languages. Specifically, we focus on Sequence Labeling tasks such as Named Entity Recognition, Opinion Target Extraction, and Argument Mining.
The research is structured around three main objectives: (1) advancing data-based cross-lingual transfer learning methods through improved translation and annotation projection techniques, (2) developing enhanced model-based transfer learning approaches utilizing state-of-the-art multilingual models, and (3) applying these methods to real-world problems while creating open-source resources that facilitate future research in low-resource NLP.
More specifically, this thesis presents a new method to improve data-based transfer with T-Projection, a state-of-the-art annotation projection method that leverages text-to-text multilingual models and machine translation systems. T-Projection significantly outperforms previous annotation projection methods by a wide margin. For model-based transfer, we introduce a constrained decoding algorithm that enhances cross-lingual Sequence Labeling in zero-shot settings using text-to-text models. Finally, we develop Medical mT5, the first multilingual text-to-text medical model, demonstrating the practical impact of our research on real-world applications.
| Comments: | Doctoral Thesis: University of the Basque Country UPV/EHU |
| Subjects: | Computation and Language (cs.CL) |
| Cite as: | arXiv:2502.02722 [cs.CL] |
| (or arXiv:2502.02722v1 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2502.02722Focus to learn more |
Submission history
From: Iker García-Ferrero [view email]
[v1] Tue, 4 Feb 2025 21:17:46 UTC (25,942 KB)
Strengths:
Clear Context and Motivation:
The opening effectively situates the work within the broader field, highlighting recent advances with large language models and the disproportionate benefits afforded to high-resource languages like English. This sets a strong justification for focusing on low-resource languages.
Well-Defined Problem Statement:
The scarcity of data and computational resources for many languages is clearly identified as a key challenge. This makes the research goals understandable and relevant.
Specific Focus:
The emphasis on Sequence Labeling tasks—Named Entity Recognition, Opinion Target Extraction, and Argument Mining—provides concrete application areas, helping readers grasp the practical scope of the thesis.
Organized Structure and Objectives:
The research goals are clearly structured into three main objectives: improving data-based methods, advancing model-based methods, and applying these in real-world contexts while producing open-source tools. This shows a comprehensive and systematic approach.
Novelty and Contributions:
The text outlines specific novel contributions:
T-Projection: a new annotation projection method leveraging text-to-text multilingual models and MT, with significantly improved performance.
Constrained Decoding Algorithm: for zero-shot cross-lingual sequence labeling.
Medical mT5: the first multilingual medical text-to-text model, showing practical impact.
Practical Impact and Resources:
Highlighting the creation of open-source resources and real-world applications, especially in the medical domain, strengthens the thesis’s significance.
Areas for Improvement:
Quantitative Impact and Evaluation:
The abstract could benefit from mentioning some quantitative results or benchmarks that demonstrate the performance improvements (e.g., percentages, datasets used, comparative gains). This would substantiate claims like “significantly outperforms previous annotation projection methods.”
Clarify Methodological Details for Accessibility:
Terms like “T-Projection” and “constrained decoding algorithm” are introduced with good context but might still be abstract to some readers. A brief intuitive explanation of how these innovations work or why they are effective could improve understanding.
Language Consistency:
The style shifts slightly between formal academic language (“we focus on”, “this thesis presents”) and somewhat promotional language (“significantly outperforms by a wide margin”). Adjusting for a consistently objective academic tone would strengthen the professionalism.
Scope of Evaluation:
The real-world impact is demonstrated in the medical domain, which is important, but it might add value to mention if or how methods generalize beyond this domain.
Overall Impression:
This is a strong, well-structured, and contextually grounded summary appropriate for a thesis on cross-lingual NLP research. It clearly identifies a significant challenge, presents novel methodological contributions, and emphasizes both academic and practical importance. Adding some quantitative evidence of improvement and slightly clarifying the technical contributions will further enhance its clarity and impact.
If you intend this for a thesis abstract or proposal, consider adding a concise concluding sentence that summarizes the overall contributions or potential future directions to round off the presentation.
View on Arxiv
