[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ossig] BM word list - release under what license?



> so you are proposing something like GFDL, but with no restrictions on
> the use of the list itself, only that modifications remain open, right?

I had to grok the GFDL before I could answer you. GFDL pretty much
summarized my opinion and 

> take several hundred reasonable quality BM online news articles, documents
> etc into open office, as a single document.  that way there will be a low
> level of spelling errors in the source documents, but their will be some.
> in case of query, we will record which urls we got these documents from.
> 
> add a new custom dictionary in open office.
> 
> add every BM word that is flagged as a spelling mistake to the custom
> dictionary, manually referring to a paper dictionary in case of query.

Sounds like a fairly efficient way of doing it. 

I had a different idea whereby I'd build a frequency count of words in
several dozen online BM articles/news stories and then start from the
bottom of the frequency count and remove those words. This would create
a very rough probability count of the more popular BM words and it would
be easier to remove the words that are not in BM. Of course, I'd first
remove words that are all in CAPS, numbers, words with numbers in them,
email addresses, links etc so those won't even appear in the frequency
count.

Of course, I have no idea whether this would be any better than what
Chris had suggested. I guess I'll need to write it and check it out.

Cheers.
-- 
Ditesh Kumar
Ameba6 Solutions Sdn. Bhd.

"Anyone who considers arithmetical methods of producing random digits
is, of course, in a state of sin." 	- John Von Neumann


------------------------------------------------------------
To unsubscribe: send mail to ossig-request@mncc.com.my
with "unsubscribe ossig" in the body of the message