ΑΡΙΘΜΟΣ: Unicode in LispWorks

Background

I have a really basic vocabulary quiz program that I have been dabbling with, partly as a means of learning to use LispWorks and partly as a way to get a little different way of reviewing my vocabulary in my personal study of NT Greek. I obtained the basic vocabulary from Mounce's Flashworks program. There are some text files for vocabulary for a few different languages, including Greek and Hebrew. These files can be read as UTF-8 Unicode.

Unicode support in Common Lisp is non-uniform, owing to the fact that the language standard was established before Unicode arrived [Seibel]. So, what works in one implementation may not work in another. I have found this in working with LispWorks and Closure Common Lisp. Since I wanted something easy, not something clever, I have not attempted to bridge these. I just want the "easy" answers for the basic problems of reading and writing Unicode in LispWorks.

Reading and Writing Unicode

If I want to read a basic tab delimited Unicode file, line by line, I use get-text-data (below). keepers allows me to take a subset of the columns in the file

(defun get-text-data (filename &optional keepers)
  (with-open-file (stream filename :external-format :utf-8)
    (loop for ln = (special-read stream) then (read-line stream nil)
          while ln
          collect (filter-columns (cl-utilities:split-sequence #\TAB ln) keepers))))

(defun special-read (stream)
  (let ((ln (read-line stream nil)))
    (when (and ln
               (> (length ln) 0)
               (eq (aref ln 0) #\Zero-Width-No-Break-Space))
      (setf ln (subseq ln 1)))
    ln))

;; list can be any non-empty list
;; keepers should be an array and columns should be specified by zero-based indexes
;; no indices in keepers should exceed the number of items in the list
;; the order of the indices in keepers is the order in which they will be returned in
;; i.e., the order of values can be changed this way
(defun filter-columns (list keepers)
  (if keepers
      (loop with arr = (concatenate 'vector list)
            for i from 0 to (1- (length keepers))
            collect (aref arr (aref keepers i)))
    list))

The first awful thing you will notice here is the special-read function. This is not always necessary, but I did have a case where I had a leading Byte-Order-Mark (BOM: 0xFEFF) that I needed remove from the file. This, somewhat oddly, but understandably, is called #\Zero-Width-No-Break-Space in LispWorks [WikiPedia]. If memory serves, putting the BOM in (which I did on purpose at one point) made reading the file work without specifying a format to read in. But the BOM is optional for utf-8.

Writing out to a file is quite straightforward, but the question is whether it will be possible to read it in correctly. The answer is that it depends on how you will read it in later. If you are only going to read it into the same program or into programs that will assume UTF-8, then there's no issue. But if you want to be able to read it in without thinking about whether the source file is Unicode or not, you can deliberately output the BOM so that it will read successfully in LispWorks even if you did not specify the :external-format. Below is a sample of doing just this.

(defmethod write-to-file ((cc vocabulary-card-collection) filename)
  (with-open-file (stream filename 
                          :direction :output 
                          :if-exists :supersede 
                          :external-format '(:utf-8 :eol-style :crlf))
    (write-char #\Zero-Width-No-Break-Space stream)
    (print (collection->plist cc) stream)))

The function colleciton->plist is my own function which turns my object into a plist which can be read by the built-in function read. (This is my way of textifying CLOS objects to write them to file—the details are not relevant to my topic.)

Now, I can read in the plist that I wrote out, without specifying the :external-format, as in:

(with-open-file (stream "d:/documents/blah2.txt")
               (read stream))

However, if I didn't manually output the BOM, I would need to write:

(with-open-file (stream "d:/documents/blah2.txt" :external-format :utf-8)
               (read stream))

The long and the short of it is that you can use :external-format all the time and use a consistent format for files that pertain to your program and that's probably the best practice. If you want to be sure another LispWorks program is going to be OK with your UTF-8 file, whether it is consistently using that format or not, putting a BOM at the start for good measure may be a good idea. On the other hand, maybe the reader will choke on the BOM because it doesn't filter it out when reading a file it already knows is utf-8.

So, I didn't solve the problem, but if you're head butting a format issue in Unicode, maybe this says enough about it that you can find something to do with the particular data you can't seem to read right now—because, I feel your pain.

For more information on IO for Unicode in LispWorks, see the manual.

ΑΡΙΘΜΟΣ

Pages

Friday, December 23, 2016

Unicode in LispWorks

Background

Reading and Writing Unicode

No comments:

Blog Archive

Most Viewed