Building an Open Bilingual Q&A Dataset for Swedish Construction Law
The author built an open, bilingual, legally-grounded Q&A dataset for the Swedish construction industry, covering topics like permits, taxes, trades, legal issues, and regulations. The dataset contains 503 question-answer pairs in both Swedish and English, with each answer citing relevant Swedish statutes or authority guidance.
Why it matters
This dataset provides a valuable resource for training AI models on Swedish construction law and regulations, which can help automate legal research and question-answering in this domain.
Key Points
- 1Developed a 503-entry Q&A dataset covering Swedish construction law and regulations
- 2Dataset is bilingual (Swedish and English) and released under CC BY 4.0 license
- 3Answers are 30-150 words long and cite specific Swedish legal sources
- 4Designed dataset for use in training language models on Swedish legal/construction domain
- 5Released in multiple formats (JSON, JSONL, CSV) to ease integration
Details
The author identified a lack of open, legally-grounded training data for Swedish construction-related topics, which are often fragmented across various government websites and PDFs. To address this, they built a dataset of 503 question-answer pairs covering 39 categories, including permits, taxes, trades, legal issues, regulations, and cost/dispute resolution. Each answer is 30-150 words long and cites the relevant Swedish statute or authority guidance, such as PBL, BBR, or Skatteverket. The dataset is bilingual, with both Swedish and English versions provided, and is released under a CC BY 4.0 license. The author shares design choices, such as preserving Swedish legal terminology in the English set and embedding citations directly in the answer text. The dataset is available via Hugging Face and as a pip-installable Python package, with support for filtering by category and iterating over the data for language model fine-tuning.
No comments yet
Be the first to comment