Skip to content

Corpws o frawddegau o destun Cymraeg wedi'u trwyddedu o dan drwydded CC0 | A corpus of Welsh texts licensed under the CC0 licence

License

Notifications You must be signed in to change notification settings

techiaith/corpws-CC0

Repository files navigation

English below

DOI

Corpws CC0

Dyma gorpws o frawddegau o destun Cymraeg wedi'u trwyddedu o dan drwydded CC0. Ar hyn o bryd, mae'r corpws yn cynnwys bron i 20,000 o frawddegau dros 180,000 o docynnau, a'r bwriad yw parhau i'w gynyddu wrth i ni gael gafael ar destunau o dan y drwydded briodol. Bwriad y corpws hwn y galluogi hyfforddi modelau iaith Cymraeg ar gyfer sawl diben gwahanol.

Casglwyd y testunau o wahanol ffynonellau gan gynnwys testunau allan o hawlfraint a thestunau a rannwyd â ni o dan drwydded CC0 gan awduron gwreiddiol, er enghraifft erthyglau Wicipedia a negesuon Twitter a ysgrifenwyd gan yr unigolion hynny. Mae'r testunau hefyd yn cynnwys brawddegau a awdurwyd gan staff y project er mwyn darparu enghreifftiau o nodweddion ieithyddol penodol i'r corpws.

Casglwyd llawer o'r testunau hyn er mwyn eu cyfrannu i Common Voice, project gan gwmni Mozilla sy'n casglu data agored er mwyn creu lleisiau synthetig ar gyfer ieithoedd y byd. Mae'r ffeil hon felly yn cynnwys nifer o'r un brawddegau a geir yn https://github.com/techiaith/brawddegau-adnabod-lleferydd, ond yn ychwanegol at hynny ceir brawddegau eraill oedd yn rhy hir ar gyfer anghenion Common Voice, neu'n cynnwys nodau neu gynnwys arall a oedd yn anaddas ar gyfer y promtiau recordio.

Ychwanegiad Hydref 2021

Rydym hefyd wedi ychwanegu at gynnwys y corpws hwn drwy ddethol is-set o dros 100k o frawddegau Cymraeg o gorpws CoVost Facebook o gyfieithiadau peirianyddol o frawddegau Saesneg Common Voice. Lluniwyd yr is-set hon (a fwriadwyd yn wreiddiol ar gyfer gweithredu fel promptiau recordio) drwy hidlo allan y brawddegau hynny oedd yn hwy na 15 gair, neu'n cynnwys digidau, acronymau neu dalfyriadau, neu a oedd yn cynnwys geiriau nad oeddynt yn Lecsicon Cymraeg Bangor (ag eithrio rhai geirffurfiau penodol). Gweler https://github.com/techiaith/brawddegau-adnabod-lleferydd/blob/master/data/covost/README.md am ragor o fanylion. Gan nad brawddegau a awdurwyd yn y Gymraeg yn wreiddiol yw'r rhain, rydym wedi eu cadw ar wahân mewn ail ffeil, sef cy_covost_subset.txt, fel y gallwch benderfynu eu defnyddio ai peidio yn ddibynnol ar eich angen penodol chi. Er mai brawddegau a gyfieithwyd yn beirianyddol yw'r rhain, adolygwyd sampl ohonynt gan olygyddion dynol a chael bod llai na 5% ohonynt yn broblemus (ffigwr sy'n cymharu'n dda â realiti y testunau Cymraeg gwreiddiol a gawn ar y we). Yn ogystal, teimlwn fod y brawddegau hyn yn ddefnyddiol gan eu bod yn cynnwys detholiad o bynciau ac amserau a phersonau gramadegol sy'n anodd i'w cael fel arall o fewn casgliad o destunau sydd â thrwydded rydd fel CC0 arni. Er na chredwn y byddai testunau cy_covost_subset.txt, yn addas ar gyfer dadansoddiadau diwylliannol a ieithyddol gymdeithasol o'r Gymraeg, credwn eu bod yn werthfawr ar gyfer hyfforddi modelau iaith uniaith Cymraeg lle nad oes digon o destunau gwreiddiol Cymraeg ar gael fel arall.

Ychwanegiad Mawrth 2023

Rydym hefyd wedi ychwanegu at gynnwys y corpws hwn drwy normaleiddio detholiad o’r lleferydd a drawsgrifiwyd yn ‘verbatim’ gennym er mwyn ei gyhoeddi o fewn ein banc trawsgrifiadau. At ei gilydd, rydym wedi normaleiddio dros 4000 o’r trawsgrifiadau hynny a'u hychwanegu at y corpws hwn fel ffeil ar wahân. Gweler: https://git.techiaith.bangor.ac.uk/data-porth-technolegau-iaith/banc-trawsgrifiadau-bangor am fwy o fanylion ynghylch ffurf wreiddiol y trawsgrifiadau a’r egwyddorion trawsgrifio y defnyddiwyd, neu i lwytho’r banc cyfan i lawr.

Cyfrannu

Gallwch ein helpu i gynyddu maint y corpws hwn drwy gyfrannu unrhyw destunau o'ch eiddo chi i ni o dan drwydded CC0 fel eu bod ar gael yn rhydd i bawb. Os am wneud hynny, cysylltwch â [email protected].

CC0 Corpus

This is a corpus of Welsh texts licensed under the CC0 licence. The corpus currently contains nearly 20,000 sentences and over 180,000 tokens, and our aim is to continue to increase it's size as and when we're able to secure texts under the appropriate license. This corpus is intended to enable the training of language models for a variety of different purposes.

The texts were collected from various sources including out-of-copyright texts and texts that were shared with us under the CC0 license by original authors, for example Wikipedia articles and Twitter messages written by individuals responsible for their creation. The texts also include sentences authored by project staff with the intention of providing the corpus with examples of specific linguistic features.

Many of these texts were collected for input into Common Voice, a project by Mozilla that collects open data to create synthetic voices for world languages. This file therefore contains many of the same sentences found at https://github.com/techiaith/brawddegau-adnabod-lleferydd, but in addition to those, this corpus also contains many sentences that were too long for the needs of Common Voice needs, or which contained characters or other content that were unsuitable for the recording prompts.

October 2021 Addition

We have added to the content of this corpus by selecting a subset of over 100k Welsh sentences from the CoVost Facebook corpus of machine translated English Common Voice sentences. This subset (originally intended to serve as recording prompts) was created by filtering out sentences that exceeded 15 words, contained digits, acronyms or abbreviations, or contained words not found in the Bangor Welsh Lexicon (with some exceptions). See https://github.com/techiaith/brawddegau-adnabod-lleferydd/blob/master/data/covost/README.md for more details. As these sentences were not originally written in Welsh, we have kept them separate in a second file, cy_covost_subset.txt, so you may decide whether or not to use them depending on your specific aims. Although these are machine translated sentences, a sample of the texts reviewed by human editors who found that less than 5% of the sentences were problematic (a figure that compares well to the situation with the original Welsh texts that are found on the web). We have found these sentences to be useful as they contain a selection topics and grammatical tenses and persons that are otherwise difficult to find within freely licensed texts. As a result, whilst we do not recommend using cy_covost_subset.txt texts for cultural and social linguistic analysis of the Welsh language, we believe that they are valuable for training monolingual Welsh language models where there would otherwise be insufficient original Welsh texts available.

March 2023 Addition

We have also added to the content of this corpus by normalizing a selection of the speech we transcribed in a 'verbatim' style for publication in our transcript bank. In total, we have normalized over 4000 of those transcriptions which have been added to this corpus as a separate file. See: https://git.techiaith.bangor.ac.uk/data-porth-technologiau-iaith/banc-transcripts-bangor for more information in respect of the original format of the transcriptions and the transcription conventions used, or to download the transcription bank in its entirety.

Contributing

You can help us increase the size of this corpus by donating any texts thatt you may own to us under the CC0 license so that they may be freely available. To do so, please contact [email protected].

About

Corpws o frawddegau o destun Cymraeg wedi'u trwyddedu o dan drwydded CC0 | A corpus of Welsh texts licensed under the CC0 licence

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published