Friday, December 23, 2016

Unicode in LispWorks

Background

I have a really basic vocabulary quiz program that I have been dabbling with, partly as a means of learning to use LispWorks and partly as a way to get a little different way of reviewing my vocabulary in my personal study of NT Greek. I obtained the basic vocabulary from Mounce's Flashworks program. There are some text files for vocabulary for a few different languages, including Greek and Hebrew. These files can be read as UTF-8 Unicode.

Unicode support in Common Lisp is non-uniform, owing to the fact that the language standard was established before Unicode arrived [Seibel]. So, what works in one implementation may not work in another. I have found this in working with LispWorks and Closure Common Lisp. Since I wanted something easy, not something clever, I have not attempted to bridge these. I just want the "easy" answers for the basic problems of reading and writing Unicode in LispWorks.

Reading and Writing Unicode

If I want to read a basic tab delimited Unicode file, line by line, I use get-text-data (below). keepers allows me to take a subset of the columns in the file



The first awful thing you will notice here is the special-read function. This is not always necessary, but I did have a case where I had a leading Byte-Order-Mark (BOM: 0xFEFF) that I needed remove from the file. This, somewhat oddly, but understandably, is called #\Zero-Width-No-Break-Space in LispWorks [WikiPedia]. If memory serves, putting the BOM in (which I did on purpose at one point) made reading the file work without specifying a format to read in. But the BOM is optional for utf-8.

Writing out to a file is quite straightforward, but the question is whether it will be possible to read it in correctly. The answer is that it depends on how you will read it in later. If you are only going to read it into the same program or into programs that will assume UTF-8, then there's no issue. But if you want to be able to read it in without thinking about whether the source file is Unicode or not, you can deliberately output the BOM so that it will read successfully in LispWorks even if you did not specify the :external-format. Below is a sample of doing just this.



The function colleciton->plist is my own function which turns my object into a plist which can be read by the built-in function read. (This is my way of textifying CLOS objects to write them to file—the details are not relevant to my topic.)

Now, I can read in the plist that I wrote out, without specifying the :external-format, as in:



However, if I didn't manually output the BOM, I would need to write:



The long and the short of it is that you can use :external-format all the time and use a consistent format for files that pertain to your program and that's probably the best practice. If you want to be sure another LispWorks program is going to be OK with your UTF-8 file, whether it is consistently using that format or not, putting a BOM at the start for good measure may be a good idea. On the other hand, maybe the reader will choke on the BOM because it doesn't filter it out when reading a file it already knows is utf-8.

So, I didn't solve the problem, but if you're head butting a format issue in Unicode, maybe this says enough about it that you can find something to do with the particular data you can't seem to read right now—because, I feel your pain.

For more information on IO for Unicode in LispWorks, see the manual.

No comments: