-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add charwidth property #2
Comments
It looks like Unicode does not actually maintain a definitive list of character widths. The only data I can find on this topic is on East Asian characters; my understanding of how this ought to translate into numerical character widths is summarized in this gist. Confusingly, there are wide(W)-narrow(Na) and full(F)-half(H) pairings which appear to have the same functionality; I'm not clear on the differences between the two pairs. |
See also this wcwidth implementation, which I think we use in Julia on Windows. Unfortunately, it dates to Unicode 5.0. |
And a slightly patched version of wcwidth in newlib, also quite old. |
Here is a BSD licesced Go version, don't know if it is completely up to date with the latest unicode standard. |
Nevermind, I see it is just a post of Markus Kuhns version. |
@jakebolewski and I were just discussing how we could test The challenge here is finding a useable font; there aren't that many monospaced fonts with decent Unicode support. I personally use Consolas and find its support for the BMP to be pretty good. But GNU Unifont 7.0 seems like the one for this task, should anyone be foolhardy enough to try. |
Or it may be sufficient to simply compute bounding boxes from Unifont's font table bitmap. |
Or process the Unicode archived code charts directly. (93 mb pdf) |
Note that GNU libunistring also provides character width functions (LGPL). I'm not sure how they are implemented or how up-to-date they are, but at the very least they provide a useful baseline to compare to. |
Here is one chunk of code to address a prerequisite part of the problem, which is how the output of This code snippet classifies code points in the BMP (from
general_category_abbr=[
"Lu", "Ll", "Lt", "Lm", "Lo", "Mn", "Mc", "Me", "Nd", "Nl",
"No", "Pc", "Pd", "Ps", "Pe", "Pi", "Pf", "Po", "Sm", "Sc",
"Sk", "So", "Zs", "Zl", "Zp", "Cc", "Cf", "Cs", "Co", "Cn"
]
categories=Dict()
for c in 0:65535
catcode = Base.UTF8proc.category_code(int32(c))
#catcode=unsafe_load(ccall((:utf8proc_get_property,:libmojibake), Ptr{Uint16}, (Int32,), c))
isprintable = isprint(char(c))
categories[(catcode, isprintable)]=push!(get(categories,
(catcode, isprintable), {}), c)
end
abbr(catcode)= catcode==0 ? "00" : general_category_abbr[catcode]
for catcode in 0:30, val in [false, true]
println(abbr(catcode), "\t", val, "\t", length(get(categories, (catcode, val), [])))
end Note that I had to accommodate a possible return value of The results for the BMP, when summarized by character category, are:
I naively thought that everything was printable except for the category C characters, but this turns out not to be true...? |
@JeffBezanson and I briefly discussed what it meant to be a printable character but I don't think we really came to a firm answer. Instead, I'm going to break it down by type of code point (Unicode 6.3 standard pdf; Ch 2, Table 2-3), which makes the decision much clearer for me:
I think there is no question that all Graphic code points should be printable and all Control/Surrogate/Noncharacter/Reserved code points should not be printable, from their very definitions. Format code points are debatable, and Private-use code points I don't think are decidable. |
Based on this, I'll propose that a printable character is any code point that is not in the Cc, Cn or Cs general categories. These are the code points for which wcwidth should return -1 and have charwidth 0. I don't think it's necessary to refine the definition of 'printable character' beyond the Unicode General Category. |
Here is another snippet of code that tries to compute character widths (and (cc @timholy: this is what I've been abusing Images for in the past few days) using Images
#Download BMP of BMP
filename="unifont-7.0.03.bmp"
isfile(filename) || download("http://unifoundry.com/pub/unifont-7.0.03/unifont-7.0.03.bmp", filename)
unitable=imread("unifont-7.0.03.bmp", Images.ImageMagick)
#Check for printable character
catcode(c::Union(Char,Integer))=Base.UTF8proc.category_code(int32(c))
isprintable(c::Union(Char,Integer)) = c ≤ 0x10ffff && isprintable_category(catcode(c))
function isprintable_category(category::Integer)
!( category==Base.UTF8proc.UTF8PROC_CATEGORY_CN #Unassigned
|| category==Base.UTF8proc.UTF8PROC_CATEGORY_CS #Surrogate
|| category==Base.UTF8proc.UTF8PROC_CATEGORY_CC #Control
)
end
#Compute left and right bounds
function wcwidthbmp(codepoint::Integer)
isprintable(codepoint) || return -1
col=(codepoint & 0xff)
row=(codepoint >> 8)
charbmp=unitable.data[32+col*16+(1:16),64+row*16+(1:16)]'
l=1
for j=1:16
any(charbmp[:,j] .== 0) && break
l += 1
end
r=16
for j=16:-1:l
any(charbmp[:,j] .== 0) && break
r -= 1
end
(r-l)//8
end
#Box it up
Boxes=Dict()
for c in 0x0000:0xffff
wcwidth_bmp = iceil(wcwidthbmp(c))
wcwidth_sys = int(ccall(:wcwidth, Int32, (Uint32,), c))
coord = (wcwidth_bmp, wcwidth_sys)
Boxes[coord]=push!(get(Boxes, coord, {}), c)
end
#Draw table in Github-flavored Markdown
k1min, k1max=extrema([k[1] for (k,v) in Boxes])
k2min, k2max=extrema([k[2] for (k,v) in Boxes])
println("wcwidth (system) -->")
print("\t |\t")
for k2=k2min:k2max
print(k2," |\t")
end
println()
println(" ------- |"^(k2max-k2min+1)*" --------")
for k1=k1min:k1max
print("__",k1, "__\t |\t")
for k2=k2min:k2max
k1==k2 && print("__")
print(length(get(Boxes, (k1,k2), [])))
k1==k2 && print("__")
print(" | \t")
end
println()
end Results (OSX 10.9.4, libutf8proc) - wcwidth on columns, derived wcwidth on row
|
That's pretty cool. |
Oh, right. And spaces will have width 0 by this computation. Take two - parse the |
Sadly I think I'll have to find a different font to read, since computing the charwidths from the font file directly is probably a GPL-derivative. |
Another item to throw into the mix - the Unicode private use region (category Co) appears to have a consensus of usage; there is a ConScript Unicode Registry |
I find this highly unlikely. Data about copyrighted things is not copyrighted. |
Ok, that makes it a lot easier. So I opened Unifont in Fontforge and saved it in Fontforge's SFD format, which is plaintext and quite easily parsed. #Read sfdfile for character widths
function parsesfd(filename::String)
CharWidths=Dict{Int,Int}()
state=:seekchar
for line in readlines(open(filename))
if state==:seekchar #StartChar: nonmarkingreturn
if contains(line, "StartChar: ")
codepoint = nothing
width = nothing
state = :readdata
end
elseif state==:readdata #Encoding: 65538 -1 2, Width: 1024
contains(line, "Encoding:") && (codepoint = int(split(line)[3]))
contains(line, "Width:") && (width = int(split(line)[2]))
if codepoint!=nothing && width!=nothing
CharWidths[codepoint]=width
state = :seekchar
end
end
end
CharWidths
end
@time CharWidths=parsesfd("UnifontMedium.sfd")
println("Number of character widths read: ", length(CharWidths))
#Classify characters
Boxes=Dict()
for c in 0x0000:0xffff
haskey(CharWidths,c) || continue
idx = (CharWidths[c]÷512, charwidth(char(c)))
Boxes[idx] = push!(get(Boxes, idx, {}), c)
end
#Output table in GFM format
print("i \\ j")
for j=0:2
print("\t | ", j )
end
println("\n"*"------- | "^3 * "-------")
for i=0:2
print("__", i, "__")
for j=0:2
print("\t | ")
i==j && print("__")
print(length(get(Boxes, (i,j), {})))
i==j && print("__")
end
println()
end The output of this program on my local MBP is
where the rows are the computed charwidths from the font advance widths and the columns are the system output from A few of the discrepancies are small enough to inspect: general_category_abbr=[
"Lu", "Ll", "Lt", "Lm", "Lo", "Mn", "Mc", "Me", "Nd", "Nl",
"No", "Pc", "Pd", "Ps", "Pe", "Pi", "Pf", "Po", "Sm", "Sc",
"Sk", "So", "Zs", "Zl", "Zp", "Cc", "Cf", "Cs", "Co", "Cn"
];
abbr(catcode)= catcode==0 ? "00" : general_category_abbr[catcode]
catabbr(c)=abbr(Base.UTF8proc.category_code(int32(c)))
map(_->(_, char(_), catabbr(_)), Boxes[0, 2])
These are all Chinese punctuation characters that are combining marks. map(_->(_, char(_), catabbr(_)), Boxes[1, 2])
|
@jiahao, as I understand it, Conscript is only about use of |
@jiahao, which system's |
This is on my MBP running OSX 10.9.4. I'll look into producing a script that can be run on other systems. |
So... I finally have a candidate. This IJulia notebook spits out an implementation of Final discrepancy table:
|
Probably you want a binary tree of |
A weighted binary tree based on the relative occurrence of the different codepoints? |
I have this code example of a macro/function pair that build a binary tree of |
I wonder whether it would be more conservative to assume |
ConScript only concerns itself with Co (private use); I don't think it touches Cn. |
Would be great to get this merged in some form (probably after converting to some sort of binary tree). |
An even simpler implementation would be to just add a 2-bit The main question is whether Jan is interested in incorporating this upstream. |
Has anything been incorporated upstream so far? Just curious how realistic upstreaming changes really is. |
They've incorporated our Unicode-7 patches (though not in a release yet) and posted a public mercurial repository. I haven't heard back yet about the grapheme updates. |
Ah, cool. That's quite positive sounding. |
Any news on the upstream inclusion of mojibake patches? I'd just like to make sure it will be ready when Julia 0.4 is released. |
I haven't heard back; just pinged them again. |
Now that we are the official utf8proc maintainers, I think we should go ahead and add a 2-bit |
Yes! |
As discussed in JuliaLang/julia#6939, the
wcwidth
function is broken on many operating systems. When we import the Unicode data, it would be good to add another field to our database in order to store the character width, so that we can provide an up-to-date character-width function.The text was updated successfully, but these errors were encountered: