Building the Romanian NLP API that should already exist
The article discusses the lack of a clean API for programmatic processing of Romanian text, and the author's efforts to build LexicRo, an open-core, hosted API platform to address this gap.
Why it matters
Providing robust NLP infrastructure for the Romanian language is crucial for developers working with Romanian text in production environments.
Key Points
- 1There is no robust API for Romanian NLP tasks like lemmatization, part-of-speech tagging, and grammatical feature extraction
- 2Existing academic resources for Romanian NLP are not packaged in a way that developers can easily use
- 3LexicRo aims to provide endpoints for morphological analysis, verb conjugation, word inflection, and lexical lookup
- 4The project is built on top of pre-existing models and datasets, with a focus on accuracy, speed, and predictable costs
Details
The author notes that Romanian NLP tooling lags significantly behind what is available for languages like English, French, and German. While the academic resources exist, such as the DEXonline dictionary and the RoLEX morphosyntactic dataset, they are not easily accessible to developers. The author is building LexicRo, an open-core, hosted API platform, to address this gap. LexicRo will provide endpoints for tasks like morphological analysis, verb conjugation, word inflection, and lexical lookup, all powered by fine-tuned BERT models and curated datasets. The project aims to deliver deterministic, structured linguistic data with predictable costs, in contrast to relying on large language models which can be less reliable and more expensive at scale. The author is seeking feedback on the endpoint design, early users, academic connections, and insights from those who have built adjacent solutions.
No comments yet
Be the first to comment