scite-context-lexer.lua /size: 48 Kb    last modification: 2023-12-21 09:42
1local info = {
2    version   = 1.400,
3    comment   = "basics for scintilla lpeg lexer for context/metafun",
4    author    = "Hans Hagen, PRAGMA-ADE, Hasselt NL",
5    copyright = "PRAGMA ADE / ConTeXt Development Team",
6    license   = "see context related readme files",
7    comment   = "contains copyrighted code from mitchell.att.foicica.com",
8
9}
10
11-- There is some history behind these lexers. When LPEG came around, we immediately adopted that in CONTEXT
12-- and one of the first things to show up were the verbatim plugins. There we have several models: line based
13-- and syntax based. The way we visualize the syntax for TEX, METAPOST and LUA relates closely to the way the
14-- CONTEXT user interface evolved. We have LPEG all over the place.
15--
16-- When at some point it became possible to have an LPEG lexer in SCITE (by using the TEXTADEPT dll) I figured
17-- out a mix of what we had and what is needed there. The lexers that came with the dll were quite slow so in
18-- order to deal with the large \LUA\ data files I rewrote the lexing so that it did work with the dll but was
19-- useable otherwise too. There are quite some comments in the older files that explain these steps. However, it
20-- never became pretty and didn't always looked the way I wanted (read: more in tune with how we use LUA in
21-- CONTEXT). Over time the plugin evolved and the code was adapted (to some extend it became more like we already
22-- had) but when SCITE moved to version 5 (as part of a C++ update) and the dll again changed it became clear
23-- that we had to come up with a different approach. Not only the dll had to be kept in sync, but we also had to
24-- keep adapting interfaces. When SCITE changed to a new lexer framework some of the properties setup changed
25-- but after adapting that it still failed to load. I noticed some new directory scanning in the dll code which
26-- probably interferes with the weay we load. (I probably need to look into that but adapting the directory
27-- structure and adding some cheats is not what I like to do.)
28--
29-- The original plan was to have TEXTADEPT as fallback but at the pace it was evolving it was not something we
30-- could use yet. Because it was meant to be configurable we even had a stripped down, tuned for CONTEXT related
31-- document processing, interface defined. After all it is good to have a fallback in case SCITE fails. But keeping
32-- up with the changing interfaces made clear that it was not really meant for this (replacing components is hard
33-- and I assume it's more about adding stuff to the shipped editor, but more and more features is not what we need:
34-- editors quickly become too loaded by confusing features that make no sense when editing documents. We need
35-- something that is easy to use for novice (and occasional) users and SCITE always has been perfect for that. The
36-- nice thing about TEXTADEPT is that it supports more platforms, the nice thing about SCITE is that it is stable
37-- and small. I understand that the interplay between the scintilla and lexzilla and lexlpeg is subtle but because
38-- of that using it generic (other than texadept) is hard.
39--
40-- So, the question was: how to proceed. The main component missing in SCITE's LUA interface is LPEG. By adding
41-- that, plus a few bytewise styler helpers, I was able to use the lexers without the dll. The advantage of using
42-- the built in methods is that we (1) can use the same LUA instance that other script use, (2) have access to all
43-- kind of properties, (3) can have a cleaner implementation (for loading), (4) can make the code look better. In
44-- retrospect I should have done that long ago. In the end it turned out that the new implementaion is just as
45-- fast but also more memory efficient (the dll could occasionally crash on many open files and loading many files
46-- when restarting was pretty slow too probably because of excessive immediate lexing).
47--
48-- It will take a while to strip out all the artifacts needed for the dll based lexer but we'll get there. Because
49-- we also supported the regular lexers that came with the dll some keys got the names needed there but it no
50-- longer makes sense: we can use the built-in SCITE lexers for those. One of the things that is gone is the
51-- whitespace trickery: we always lex the whole document, as we already did most of the time (the only possible
52-- gain is when one is at the end of a document and then we observed side effects of not enough backtracking).
53--
54-- I will keep the old files archived so we can always use the (optimized) helpers from those if we ever need
55-- them. I could go back to the code we had before the dll came around but it makes no sense, so for now I just
56-- pruned and rewrote. The lexer definitions are still such that we could load other lexers but that compatbility
57-- has now been dropped so I might clean up that bit too. It's not that hard to write additional lexers if I need
58-- them.
59--
60-- We assume at least LUA 5.3 now (tests with LUA 5.4 demonstrated a 10% performance gain). I will also make a
61-- helper module that has all the nice CONTEXT functions available. Logging to file is gone because in SCITE we
62-- can write to the output pane. Actually: I'm still waiting for scite to overload that output pain lexer.
63--
64-- As mentioned, the dll based lexer uses whitespace to determine where to start and then only lexes what comes
65-- after it. In the mixed lexing that we use that hardly makes sense, because editing before the end still needs
66-- to backtrack. The question then becomes if we really save runtime. Also, we can be nested inside nested which
67-- never worked well but we can do that now. We also use one thems so there is no need to be more clever. We no
68-- longer keep the styles in a lexer simply because we use a consistent set and have plenty of styles in SCITE now.
69--
70-- The previous versions had way more code because we also could load the lexers shipped with the dll, had quite
71-- some optimizations and caching for older dll's and SCITE limitations, so the real tricks are in these old files.
72--
73-- We now can avoid the intermediate tables in SCITE and only use them when we lex in CONTEXT. So in the end we're
74-- back where we started more than a decade ago. It's a pitty that we dropped TEXTADEPT support but it was simply
75-- too hard to keep up. So be it. Maybe some day ... after all we still have the old code.
76--
77-- We had the lexers namespace plus additional tables and functions in the lexerx.context namespace in order not to
78-- overload 'original' functionality but the context subtable could go away.
79--
80-- Performance: I decided to go for whole document lexing every time which is fast enough for what we want. If a
81-- file is very (!) large one can always choose to "none" lexer in the interface. The advantage of whole parsing
82-- is that it is more robust than wildguessing on whitespace (which can fail occasionally), that we are less likely
83-- to crash after being in the editor for a whole day, and that preamble scanning etc is now more reliable. If
84-- needed I can figure out some gain (but a new and faster machine makes more sense). There is optional partial
85-- document lexing (under testing). In any case, the former slow loading many documents at startup delay is gone
86-- now (somehow it looked like all tabs were lexed when a document was opened).
87
88local global = _G
89
90local lpeg  = require("lpeg")
91
92if lpeg.setmaxstack then lpeg.setmaxstack(1000) end
93
94local gmatch, match, lower, upper, gsub, format = string.gmatch, string.match, string.lower, string.upper, string.gsub, string.format
95local concat, sort = table.concat, table.sort
96local type, next, setmetatable, tostring = type, next, setmetatable, tostring
97local R, P, S, C, Cp, Ct, Cmt, Cc, Cf, Cg, Cs = lpeg.R, lpeg.P, lpeg.S, lpeg.C, lpeg.Cp, lpeg.Ct, lpeg.Cmt, lpeg.Cc, lpeg.Cf, lpeg.Cg, lpeg.Cs
98local lpegmatch = lpeg.match
99
100local usage    = resolvers and "context" or "scite"
101local trace    = false
102local collapse = false -- can save some 15% (maybe easier on scintilla)
103
104local lexers     = { }
105local styles     = { }
106local numbers    = { }
107local helpers    = { }
108local patterns   = { }
109local usedlexers = { }
110
111lexers.usage     = usage
112
113lexers.helpers   = helpers
114lexers.styles    = styles
115lexers.numbers   = numbers
116lexers.patterns  = patterns
117
118-- Maybe at some point I will just load the basic mtx toolkit which gives a lot of benefits but for now we
119-- do with poor mans copies.
120--
121-- Some basic reporting.
122
123local report = logs and logs.reporter("scite lpeg lexer") or function(fmt,str,...)
124    if str then
125        fmt = format(fmt,str,...)
126    end
127    print(format("scite lpeg lexer > %s",fmt))
128end
129
130report("loading context lexer module")
131
132lexers.report = report
133
134local function sortedkeys(hash) -- simple version, good enough for here
135    local t, n = { }, 0
136    for k, v in next, hash do
137        t[#t+1] = k
138        local l = #tostring(k)
139        if l > n then
140            n = l
141        end
142    end
143    sort(t)
144    return t, n
145end
146
147helpers.sortedkeys = sortedkeys
148
149-- begin of patterns (we should take them from l-lpeg.lua)
150
151do
152
153    local anything             = P(1)
154    local idtoken              = R("az","AZ","\127\255","__")
155    local digit                = R("09")
156    local sign                 = S("+-")
157    local period               = P(".")
158    local octdigit             = R("07")
159    local hexdigit             = R("09","AF","af")
160    local lower                = R("az")
161    local upper                = R("AZ")
162    local alpha                = upper + lower
163    local space                = S(" \n\r\t\f\v")
164    local eol                  = S("\r\n")
165    local backslash            = P("\\")
166    local decimal              = digit^1
167    local octal                = P("0")
168                               * octdigit^1
169    local hexadecimal          = P("0") * S("xX")
170                               * (hexdigit^0 * period * hexdigit^1 + hexdigit^1 * period * hexdigit^0 + hexdigit^1)
171                               * (S("pP") * sign^-1 * hexdigit^1)^-1 -- *
172    local integer              = sign^-1
173                               * (hexadecimal + octal + decimal)
174    local float                = sign^-1
175                               * (digit^0 * period * digit^1 + digit^1 * period * digit^0 + digit^1)
176                               * S("eE") * sign^-1 * digit^1 -- *
177
178    patterns.idtoken           = idtoken
179    patterns.digit             = digit
180    patterns.sign              = sign
181    patterns.period            = period
182    patterns.octdigit          = octdigit
183    patterns.hexdigit          = hexdigit
184    patterns.ascii             = R("\000\127") -- useless
185    patterns.extend            = R("\000\255") -- useless
186    patterns.control           = R("\000\031")
187    patterns.lower             = lower
188    patterns.upper             = upper
189    patterns.alpha             = alpha
190    patterns.decimal           = decimal
191    patterns.octal             = octal
192    patterns.hexadecimal       = hexadecimal
193    patterns.float             = float
194    patterns.cardinal          = decimal
195
196    local utf8next             = R("\128\191")
197
198    patterns.utf8next          = utf8next
199    patterns.utf8one           = R("\000\127")
200    patterns.utf8two           = R("\194\223") * utf8next
201    patterns.utf8three         = R("\224\239") * utf8next * utf8next
202    patterns.utf8four          = R("\240\244") * utf8next * utf8next * utf8next
203
204    patterns.signeddecimal     = sign^-1 * decimal
205    patterns.signedoctal       = sign^-1 * octal
206    patterns.signedhexadecimal = sign^-1 * hexadecimal
207    patterns.integer           = integer
208    patterns.real              =
209        sign^-1 * (                    -- at most one
210            digit^1 * period * digit^0 -- 10.0 10.
211          + digit^0 * period * digit^1 -- 0.10 .10
212          + digit^1                    -- 10
213       )
214
215    patterns.anything          = anything
216    patterns.any               = anything
217    patterns.restofline        = (1-eol)^1
218    patterns.space             = space
219    patterns.spacing           = space^1
220    patterns.nospacing         = (1-space)^1
221    patterns.eol               = eol
222    patterns.newline           = P("\r\n") + eol
223    patterns.backslash         = backslash
224
225    local endof                = S("\n\r\f")
226
227    patterns.startofline       = P(function(input,index)
228        return (index == 1 or lpegmatch(endof,input,index-1)) and index
229    end)
230
231end
232
233do
234
235    local char     = string.char
236    local byte     = string.byte
237    local format   = format
238
239    local function utfchar(n)
240        if n < 0x80 then
241            return char(n)
242        elseif n < 0x800 then
243            return char(
244                0xC0 + (n//0x00040),
245                0x80 +  n           % 0x40
246            )
247        elseif n < 0x10000 then
248            return char(
249                0xE0 + (n//0x01000),
250                0x80 + (n//0x00040) % 0x40,
251                0x80 +  n           % 0x40
252            )
253        elseif n < 0x40000 then
254            return char(
255                0xF0 + (n//0x40000),
256                0x80 + (n//0x01000),
257                0x80 + (n//0x00040) % 0x40,
258                0x80 +  n           % 0x40
259            )
260        else
261         -- return char(
262         --     0xF1 + (n//0x1000000),
263         --     0x80 + (n//0x0040000),
264         --     0x80 + (n//0x0001000),
265         --     0x80 + (n//0x0000040) % 0x40,
266         --     0x80 +  n             % 0x40
267         -- )
268            return "?"
269        end
270    end
271
272    helpers.utfchar = utfchar
273
274    local utf8next         = R("\128\191")
275    local utf8one          = R("\000\127")
276    local utf8two          = R("\194\223") * utf8next
277    local utf8three        = R("\224\239") * utf8next * utf8next
278    local utf8four         = R("\240\244") * utf8next * utf8next * utf8next
279
280    helpers.utf8one   = utf8one
281    helpers.utf8two   = utf8two
282    helpers.utf8three = utf8three
283    helpers.utf8four  = utf8four
284
285    local utfidentifier    = utf8two + utf8three + utf8four
286    helpers.utfidentifier  = (R("AZ","az","__")      + utfidentifier)
287                           * (R("AZ","az","__","09") + utfidentifier)^0
288
289    helpers.utfcharpattern = P(1) * utf8next^0 -- unchecked but fast
290    helpers.utfbytepattern = utf8one   / byte
291                           + utf8two   / function(s) local c1, c2         = byte(s,1,2) return   c1 * 64 + c2                       -    12416 end
292                           + utf8three / function(s) local c1, c2, c3     = byte(s,1,3) return  (c1 * 64 + c2) * 64 + c3            -   925824 end
293                           + utf8four  / function(s) local c1, c2, c3, c4 = byte(s,1,4) return ((c1 * 64 + c2) * 64 + c3) * 64 + c4 - 63447168 end
294
295    local p_false          = P(false)
296    local p_true           = P(true)
297
298    local function make(t)
299        local function making(t)
300            local p    = p_false
301            local keys = sortedkeys(t)
302            for i=1,#keys do
303                local k = keys[i]
304                if k ~= "" then
305                    local v = t[k]
306                    if v == true then
307                        p = p + P(k) * p_true
308                    elseif v == false then
309                        -- can't happen
310                    else
311                        p = p + P(k) * making(v)
312                    end
313                end
314            end
315            if t[""] then
316                p = p + p_true
317            end
318            return p
319        end
320        local p    = p_false
321        local keys = sortedkeys(t)
322        for i=1,#keys do
323            local k = keys[i]
324            if k ~= "" then
325                local v = t[k]
326                if v == true then
327                    p = p + P(k) * p_true
328                elseif v == false then
329                    -- can't happen
330                else
331                    p = p + P(k) * making(v)
332                end
333            end
334        end
335        return p
336    end
337
338    local function collapse(t,x)
339        if type(t) ~= "table" then
340            return t, x
341        else
342            local n = next(t)
343            if n == nil then
344                return t, x
345            elseif next(t,n) == nil then
346                -- one entry
347                local k = n
348                local v = t[k]
349                if type(v) == "table" then
350                    return collapse(v,x..k)
351                else
352                    return v, x .. k
353                end
354            else
355                local tt = { }
356                for k, v in next, t do
357                    local vv, kk = collapse(v,k)
358                    tt[kk] = vv
359                end
360                return tt, x
361            end
362        end
363    end
364
365    function helpers.utfchartabletopattern(list)
366        local tree = { }
367        local n = #list
368        if n == 0 then
369            for s in next, list do
370                local t = tree
371                local p, pk
372                for c in gmatch(s,".") do
373                    if t == true then
374                        t = { [c] = true, [""] = true }
375                        p[pk] = t
376                        p = t
377                        t = false
378                    elseif t == false then
379                        t = { [c] = false }
380                        p[pk] = t
381                        p = t
382                        t = false
383                    else
384                        local tc = t[c]
385                        if not tc then
386                            tc = false
387                            t[c] = false
388                        end
389                        p = t
390                        t = tc
391                    end
392                    pk = c
393                end
394                if t == false then
395                    p[pk] = true
396                elseif t == true then
397                    -- okay
398                else
399                    t[""] = true
400                end
401            end
402        else
403            for i=1,n do
404                local s = list[i]
405                local t = tree
406                local p, pk
407                for c in gmatch(s,".") do
408                    if t == true then
409                        t = { [c] = true, [""] = true }
410                        p[pk] = t
411                        p = t
412                        t = false
413                    elseif t == false then
414                        t = { [c] = false }
415                        p[pk] = t
416                        p = t
417                        t = false
418                    else
419                        local tc = t[c]
420                        if not tc then
421                            tc = false
422                            t[c] = false
423                        end
424                        p = t
425                        t = tc
426                    end
427                    pk = c
428                end
429                if t == false then
430                    p[pk] = true
431                elseif t == true then
432                    -- okay
433                else
434                    t[""] = true
435                end
436            end
437        end
438        collapse(tree,"")
439        return make(tree)
440    end
441
442    patterns.invisibles = helpers.utfchartabletopattern {
443        utfchar(0x00A0), -- nbsp
444        utfchar(0x2000), -- enquad
445        utfchar(0x2001), -- emquad
446        utfchar(0x2002), -- enspace
447        utfchar(0x2003), -- emspace
448        utfchar(0x2004), -- threeperemspace
449        utfchar(0x2005), -- fourperemspace
450        utfchar(0x2006), -- sixperemspace
451        utfchar(0x2007), -- figurespace
452        utfchar(0x2008), -- punctuationspace
453        utfchar(0x2009), -- breakablethinspace
454        utfchar(0x200A), -- hairspace
455        utfchar(0x200B), -- zerowidthspace
456        utfchar(0x202F), -- narrownobreakspace
457        utfchar(0x205F), -- math thinspace
458        utfchar(0x200C), -- zwnj
459        utfchar(0x200D), -- zwj
460    }
461
462    -- now we can make:
463
464    patterns.wordtoken    = R("az","AZ","\127\255")
465    patterns.wordpattern  = patterns.wordtoken^3 -- todo: if limit and #s < limit then
466
467    patterns.iwordtoken   = patterns.wordtoken - patterns.invisibles
468    patterns.iwordpattern = patterns.iwordtoken^3
469
470end
471
472-- end of patterns
473
474-- begin of scite properties
475
476-- Because we use a limited number of lexers we can provide a new whitespace on demand. If needed
477-- we can recycle from a pool or we can just not reuse a lexer and load anew. I'll deal with that
478-- when the need is there. At that moment I might as well start working with nested tables (so that
479-- we have a langauge tree.
480
481local whitespace = function() return "whitespace" end
482
483local maxstyle    = 127 -- otherwise negative values in editor object -- 255
484local nesting     = 0
485local style_main  = 0
486local style_white = 0
487
488if usage == "scite" then
489
490    local names = { }
491    local props = { }
492    local count = 1
493
494    -- 32 -- 39 are reserved; we want to avoid holes so we preset:
495
496    for i=0,maxstyle do
497        numbers[i] = "default"
498    end
499
500    whitespace = function()
501        return style_main -- "mainspace"
502    end
503
504    function lexers.loadtheme(theme)
505        styles = theme or { }
506        for k, v in next, styles do
507            names[#names+1] = k
508        end
509        sort(names)
510        for i=1,#names do
511            local name = names[i]
512            styles[name].n = count
513            numbers[name] = count
514            numbers[count] = name
515            if count == 31 then
516                count = 40
517            else
518                count = count + 1
519            end
520        end
521        for i=1,#names do
522            local t = { }
523            local s = styles[names[i]]
524            local n = s.n
525            local fore = s.fore
526            local back = s.back
527            local font = s.font
528            local size = s.size
529            local bold = s.bold
530            if fore then
531                if #fore == 1 then
532                    t[#t+1] = format("fore:#%02X%02X%02X",fore[1],fore[1],fore[1])
533                elseif #fore == 3 then
534                    t[#t+1] = format("fore:#%02X%02X%02X",fore[1],fore[2],fore[3])
535                end
536            end
537            if back then
538                if #back == 1 then
539                    t[#t+1] = format("back:#%02X%02X%02X",back[1],back[1],back[1])
540                elseif #back == 3 then
541                    t[#t+1] = format("back:#%02X%02X%02X",back[1],back[2],back[3])
542                else
543                    t[#t+1] = "back:#000000"
544                end
545            end
546            if bold then
547                t[#t+1] = "bold"
548            end
549            if font then
550                t[#t+1] = format("font:%s",font)
551            end
552            if size then
553                t[#t+1] = format("size:%s",size)
554            end
555            if #t > 0 then
556                props[n] = concat(t,",")
557            end
558        end
559        setmetatable(styles, {
560            __index =
561                function(target,name)
562                    if name then
563                        count = count + 1
564                        if count > maxstyle then
565                            count = maxstyle
566                        end
567                        numbers[name] = count
568                        local style = { n = count }
569                        target[name] = style
570                        return style
571                    end
572                end
573        } )
574        lexers.styles  = styles
575        lexers.numbers = numbers
576
577        style_main  = styles.mainspace.n
578        style_white = styles.whitespace.n
579    end
580
581    function lexers.registertheme(properties,name)
582        for n, p in next, props do
583            local tag = "style.script_" .. name .. "." .. n
584            properties[tag] = p
585        end
586    end
587
588end
589
590-- end of scite properties
591
592-- begin of word matchers
593
594do
595
596  -- function patterns.exactmatch(words,case_insensitive)
597  --     local characters = concat(words)
598  --     local pattern = S(characters) + patterns.idtoken
599  --     if case_insensitive then
600  --         pattern = pattern + S(upper(characters)) + S(lower(characters))
601  --     end
602  --     if case_insensitive then
603  --         local list = { }
604  --         if #words == 0 then
605  --             for k, v in next, words do
606  --                 list[lower(k)] = v
607  --             end
608  --         else
609  --             for i=1,#words do
610  --                 list[lower(words[i])] = true
611  --             end
612  --         end
613  --         return Cmt(pattern^1, function(_,i,s)
614  --             return list[lower(s)] -- and i or nil
615  --         end)
616  --     else
617  --         local list = { }
618  --         if #words == 0 then
619  --             for k, v in next, words do
620  --                 list[k] = v
621  --             end
622  --         else
623  --             for i=1,#words do
624  --                 list[words[i]] = true
625  --             end
626  --         end
627  --         return Cmt(pattern^1, function(_,i,s)
628  --             return list[s] -- and i or nil
629  --         end)
630  --     end
631  -- end
632  --
633  -- function patterns.justmatch(words)
634  --     local p = P(words[1])
635  --     for i=2,#words do
636  --         p = p + P(words[i])
637  --     end
638  --     return p
639  -- end
640
641    -- we could do camelcase but that is not what users use for keywords
642
643    local p_finish = #(1 - R("az","AZ","__"))
644
645    patterns.finishmatch = p_finish
646
647    function patterns.exactmatch(words,ignorecase)
648        local list = { }
649        if ignorecase then
650            if #words == 0 then
651                for k, v in next, words do
652                    list[lower(k)] = v
653                end
654            else
655                for i=1,#words do
656                    list[lower(words[i])] = true
657                end
658            end
659            return Cmt(pattern^1, function(_,i,s)
660                return list[lower(s)] -- and i or nil
661            end)
662        else
663            if #words == 0 then
664                for k, v in next, words do
665                    list[k] = v
666                end
667            else
668                for i=1,#words do
669                    list[words[i]] = true
670                end
671            end
672        end
673        return helpers.utfchartabletopattern(list) * p_finish
674    end
675
676    patterns.justmatch = patterns.exactmatch
677
678end
679
680-- end of word matchers
681
682-- begin of loaders
683
684do
685
686    local cache = { }
687
688    function lexers.loadluafile(name)
689        local okay, data = pcall(require, name)
690        if data then
691            if trace then
692                report("lua file '%s' has been loaded",name)
693            end
694            return data, name
695        end
696        if trace then
697            report("unable to load lua file '%s'",name)
698        end
699    end
700
701    function lexers.loaddefinitions(name)
702        local data = cache[name]
703        if data then
704            if trace then
705                report("reusing definitions '%s'",name)
706            end
707            return data
708        elseif trace and data == false then
709            report("definitions '%s' were not found",name)
710        end
711        local okay, data = pcall(require, name)
712        if not data then
713            report("unable to load definition file '%s'",name)
714            data = false
715        elseif trace then
716            report("definition file '%s' has been loaded",name)
717        end
718        cache[name] = data
719        return type(data) == "table" and data
720    end
721
722end
723
724-- end of loaders
725
726-- begin of spell checking (todo: pick files from distribution instead)
727
728do
729
730    -- spell checking (we can only load lua files)
731    --
732    -- return {
733    --     min   = 3,
734    --     max   = 40,
735    --     n     = 12345,
736    --     words = {
737    --         ["someword"]    = "someword",
738    --         ["anotherword"] = "Anotherword",
739    --     },
740    -- }
741
742    local lists    = { }
743    local disabled = false
744
745    function lexers.disablewordcheck()
746        disabled = true
747    end
748
749    function lexers.setwordlist(tag,limit) -- returns hash (lowercase keys and original values)
750        if not tag or tag == "" then
751            return false, 3
752        end
753        local list = lists[tag]
754        if not list then
755            list = lexers.loaddefinitions("spell-" .. tag)
756            if not list or type(list) ~= "table" then
757                report("invalid spell checking list for '%s'",tag)
758                list = { words = false, min = 3 }
759            else
760                list.words = list.words or false
761                list.min   = list.min or 3
762            end
763            lists[tag] = list
764        end
765        if trace then
766            report("enabling spell checking for '%s' with minimum '%s'",tag,list.min)
767        end
768        return list.words, list.min
769    end
770
771    if usage ~= "scite" then
772
773        function lexers.styleofword(validwords,validminimum,s,p)
774            if not validwords or #s < validminimum then
775                return "text", p
776            else
777                -- keys are lower
778                local word = validwords[s]
779                if word == s then
780                    return "okay", p -- exact match
781                elseif word then
782                    return "warning", p -- case issue
783                else
784                    local word = validwords[lower(s)]
785                    if word == s then
786                        return "okay", p -- exact match
787                    elseif word then
788                        return "warning", p -- case issue
789                    elseif upper(s) == s then
790                        return "warning", p -- probably a logo or acronym
791                    else
792                        return "error", p
793                    end
794                end
795            end
796        end
797
798    end
799
800end
801
802-- end of spell checking
803
804-- begin lexer management
805
806lexers.structured = false
807-- lexers.structured = true -- the future for the typesetting end
808
809do
810
811    function lexers.new(name,filename)
812        if not filename then
813            filename = false
814        end
815        local lexer = {
816            name       = name,
817            filename   = filename,
818            whitespace = whitespace()
819        }
820        if trace then
821            report("initializing lexer tagged '%s' from file '%s'",name,filename or name)
822        end
823        return lexer
824    end
825
826    if usage == "scite" then
827
828        -- overloaded later
829
830        function lexers.token(name, pattern)
831            local s = styles[name] -- always something anyway
832            return pattern * Cc(s and s.n or 32) * Cp()
833        end
834
835    else
836
837        function lexers.token(name, pattern)
838            return pattern * Cc(name) * Cp()
839        end
840
841    end
842
843    -- todo: variant that directly styles
844
845    local function append(pattern,step)
846        if not step then
847            return pattern
848        elseif pattern then
849            return pattern + P(step)
850        else
851            return P(step)
852        end
853    end
854
855    local function prepend(pattern,step)
856        if not step then
857            return pattern
858        elseif pattern then
859            return P(step) + pattern
860        else
861            return P(step)
862        end
863    end
864
865    local wrapup = usage == "scite" and
866        function(name,pattern)
867            return pattern
868        end
869    or
870        function(name,pattern,nested)
871            if lexers.structured then
872                return Cf ( Ct("") * Cg(Cc("name") * Cc(name)) * Cg(Cc("data") * Ct(pattern)), rawset)
873            elseif nested then
874                return pattern
875            else
876                return Ct (pattern)
877            end
878        end
879
880    local function construct(namespace,lexer,level)
881        if lexer then
882            local rules    = lexer.rules
883            local embedded = lexer.embedded
884            local grammar  = nil
885            if embedded then
886                for i=1,#embedded do
887                    local embed = embedded[i]
888                    local done  = embed.done
889                    if not done then
890                        local lexer = embed.lexer
891                        local start = embed.start
892                        local stop  = embed.stop
893                        if usage == "scite" then
894                            start = start / function() nesting = nesting + 1 end
895                            stop  = stop  / function() nesting = nesting - 1 end
896                        end
897                        if trace then
898                            start = start / function() report("    nested lexer %s: start",lexer.name) end
899                            stop  = stop  / function() report("    nested lexer %s: stop", lexer.name) end
900                        end
901                        done = start * (construct(namespace,lexer,level+1) - stop)^0 * stop
902                        done = wrapup(lexer.name,done,true)
903                    end
904               -- grammar = prepend(grammar, done)
905                  grammar = append(grammar, done)
906                end
907            end
908            if rules then
909                for i=1,#rules do
910                    grammar = append(grammar,rules[i][2])
911                end
912            end
913            return grammar
914        end
915    end
916
917    function lexers.load(filename,namespace)
918        if not namespace then
919            namespace = filename
920        end
921        local lexer = usedlexers[namespace] -- we load by filename but the internal name can be short
922        if lexer then
923            if trace then
924                report("reusing lexer '%s'",namespace)
925            end
926            return lexer
927        elseif trace then
928            report("loading lexer '%s' from '%s'",namespace,filename)
929        end
930        local lexer, name = lexers.loadluafile(filename)
931        if not lexer then
932            report("invalid lexer file '%s'",filename)
933        elseif type(lexer) ~= "table" then
934            if trace then
935                report("lexer file '%s' gets a dummy lexer",filename)
936            end
937            return lexers.new(filename)
938        end
939        local grammar = construct(namespace,lexer,1)
940        if grammar then
941            grammar = wrapup(namespace,grammar^0)
942            lexer.grammar = grammar
943        end
944        --
945        local backtracker = lexer.backtracker
946        local foretracker = lexer.foretracker
947        if backtracker then
948            local start    = 1
949            local position = 1
950            local pattern  = (Cmt(Cs(backtracker),function(s,p,m) if p > start then return #s else position = p - #m end end) + P(1))^1
951            lexer.backtracker = function(str,offset)
952                position = 1
953                start    = offset
954                lpegmatch(pattern,str,1)
955                return position
956            end
957        end
958        if foretracker then
959            local start    = 1
960            local position = 1
961            local pattern  = (Cmt(Cs(foretracker),function(s,p,m) position = p - #m return #s end) + P(1))^1
962            lexer.foretracker = function(str,offset)
963                position = offset
964                start    = offset
965                lpegmatch(pattern,str,position)
966                return position
967            end
968        end
969        --
970        usedlexers[filename] = lexer
971        return lexer
972    end
973
974    function lexers.embed(parent, embed, start, stop, rest)
975        local embedded = parent.embedded
976        if not embedded then
977            embedded        = { }
978            parent.embedded = embedded
979        end
980        embedded[#embedded+1] = {
981            lexer = embed,
982            start = start,
983            stop  = stop,
984            rest  = rest,
985        }
986    end
987
988end
989
990-- end lexer management
991
992-- This will become a configurable option (whole is more reliable but it can
993-- be slow on those 5 megabyte lua files):
994
995-- begin of context typesetting lexer
996
997if usage ~= "scite" then
998
999    local function collapsed(t)
1000        local lasttoken = nil
1001        local lastindex = nil
1002        for i=1,#t,2 do
1003            local token    = t[i]
1004            local position = t[i+1]
1005            if token == lasttoken then
1006                t[lastindex] = position
1007            elseif lastindex then
1008                lastindex = lastindex + 1
1009                t[lastindex] = token
1010                lastindex = lastindex + 1
1011                t[lastindex] = position
1012                lasttoken = token
1013            else
1014                lastindex = i+1
1015                lasttoken = token
1016            end
1017        end
1018        for i=#t,lastindex+1,-1 do
1019            t[i] = nil
1020        end
1021        return t
1022    end
1023
1024    function lexers.lex(lexer,text) -- get rid of init_style
1025        local grammar = lexer.grammar
1026        if grammar then
1027            nesting = 0
1028            if trace then
1029                report("lexing '%s' string with length %i",lexer.name,#text)
1030            end
1031            local t = lpegmatch(grammar,text)
1032            if collapse then
1033                t = collapsed(t)
1034            end
1035            return t
1036        else
1037            return { }
1038        end
1039    end
1040
1041end
1042
1043-- end of context typesetting lexer
1044
1045-- begin of scite editor lexer
1046
1047if usage == "scite" then
1048
1049    -- For char-def.lua we need some 0.55 s with Lua 5.3 and 10% less with Lua 5.4 (timed on a 2013
1050    -- Dell precision with i7-3840QM). That test file has 271540 lines of Lua (table) code and is
1051    -- 5.312.665 bytes large (dd 2021.09.29). The three methods perform about the same but the more
1052    -- direct approach saves some tables. Using the new Lua garbage collector makes no difference.
1053    --
1054    -- We can actually integrate folding in here if we want but it might become messy as we then
1055    -- also need to deal with specific newlines. We can also (in scite) store some extra state wrt
1056    -- the language used.
1057    --
1058    -- Operating on a range (as in the past) is faster when editing very large documents but we
1059    -- don't do that often. The problem is that backtracking over whitespace is tricky for some
1060    -- nested lexers.
1061
1062    local editor       = false
1063    local startstyling = false   -- editor:StartStyling(position,style)
1064    local setstyling   = false   -- editor:SetStyling(slice,style)
1065    local getlevelat   = false   -- editor.StyleAt[position] or StyleAt(editor,position)
1066    local getlineat    = false
1067    local thestyleat   = false   -- editor.StyleAt[position]
1068    local thelevelat   = false
1069
1070    local styleoffset  = 1
1071    local foldoffset   = 0
1072
1073    local function seteditor(usededitor)
1074        editor       = usededitor
1075        startstyling = editor.StartStyling
1076        setstyling   = editor.SetStyling
1077        getlevelat   = editor.FoldLevel        -- GetLevelAt
1078        getlineat    = editor.LineFromPosition
1079        thestyleat   = editor.StyleAt
1080        thelevelat   = editor.FoldLevel        -- SetLevelAt
1081    end
1082
1083    function lexers.token(style, pattern)
1084        if type(style) ~= "number" then
1085            style = styles[style] -- always something anyway
1086            style = style and style.n or 32
1087        end
1088        return pattern * Cp() / function(p)
1089            local n = p - styleoffset
1090            if nesting > 0 and style == style_main then
1091                style = style_white
1092            end
1093            setstyling(editor,n,style)
1094            styleoffset = styleoffset + n
1095        end
1096    end
1097
1098    -- used in: tex txt xml
1099
1100    function lexers.styleofword(validwords,validminimum,s,p)
1101        local style
1102        if not validwords or #s < validminimum then
1103            style = numbers.text
1104        else
1105            -- keys are lower
1106            local word = validwords[s]
1107            if word == s then
1108                style = numbers.okay -- exact match
1109            elseif word then
1110                style = numbers.warning -- case issue
1111            else
1112                local word = validwords[lower(s)]
1113                if word == s then
1114                    style = numbers.okay -- exact match
1115                elseif word then
1116                    style = numbers.warning -- case issue
1117                elseif upper(s) == s then
1118                    style = numbers.warning -- probably a logo or acronym
1119                else
1120                    style = numbers.error
1121                end
1122            end
1123        end
1124        local n = p - styleoffset
1125        setstyling(editor,n,style)
1126        styleoffset = styleoffset + n
1127    end
1128
1129    -- when we have an embedded language we can not rely on the range that
1130    -- scite provides because we need to look further
1131
1132    -- it looks like scite starts before the cursor / insert
1133
1134    local function scite_range(lexer,size,start,length,partial) -- set editor
1135        if partial then
1136            local backtracker = lexer.backtracker
1137            local foretracker = lexer.foretracker
1138            if start == 0 and size == length then
1139                -- see end
1140            elseif (backtracker or foretracker) and start > 0 then
1141                local snippet = editor:textrange(0,size)
1142                if size ~= length then
1143                    -- only lstart matters, the rest is statistics; we operate on 1-based strings
1144                    local lstart = backtracker and backtracker(snippet,start+1) or 0
1145                    local lstop  = foretracker and foretracker(snippet,start+1+length) or size
1146                    if lstart > 0 then
1147                        lstart = lstart - 1
1148                    end
1149                    if lstop > size then
1150                        lstop = size - 1
1151                    end
1152                    local stop    = start + length
1153                    local back    = start - lstart
1154                    local fore    = lstop - stop
1155                    local llength = lstop - lstart + 1
1156                 -- snippet = string.sub(snippet,lstart+1,lstop+1) -- we can return the initial position in the lpegmatch
1157                 -- return back, fore, lstart, llength, snippet, lstart + 1
1158                    return back, fore, 0, llength, snippet, lstart + 1
1159                else
1160                    return 0, 0, 0, size, snippet, 1
1161                end
1162            else
1163                -- still not entirely okay (nested mp)
1164                local stop   = start + length
1165                local lstart = start
1166                local lstop  = stop
1167                while lstart > 0 do
1168                    if thestyleat[lstart] == style_main then
1169                        break
1170                    else
1171                        lstart = lstart - 1
1172                    end
1173                end
1174                if lstart < 0 then
1175                    lstart = 0
1176                end
1177                while lstop < size do
1178                    if thestyleat[lstop] == style_main then
1179                        break
1180                    else
1181                        lstop = lstop + 1
1182                    end
1183                end
1184                if lstop > size then
1185                    lstop = size
1186                end
1187                local back    = start - lstart
1188                local fore    = lstop - stop
1189                local llength = lstop - lstart + 1
1190                local snippet = editor:textrange(lstart,lstop)
1191                if llength > #snippet then
1192                    llength = #snippet
1193                end
1194                return back, fore, lstart, llength, snippet, 1
1195            end
1196        end
1197        local snippet = editor:textrange(0,size)
1198        return 0, 0, 0, size, snippet, 1
1199    end
1200
1201    local function scite_lex(lexer,text,offset,initial)
1202        local grammar = lexer.grammar
1203        if grammar then
1204            styleoffset = 1
1205            nesting     = 0
1206            startstyling(editor,offset,32)
1207            local preamble = lexer.preamble
1208            if preamble then
1209                lpegmatch(preamble,offset == 0 and text or editor:textrange(0,500))
1210            end
1211            lpegmatch(grammar,text,initial)
1212        end
1213    end
1214
1215    -- We can assume sane definitions that is: must languages use similar constructs for the start
1216    -- and end of something. So we don't need to waste much time on nested lexers.
1217
1218    local newline           = patterns.newline
1219
1220    local scite_fold_base   = SC_FOLDLEVELBASE       or 0
1221    local scite_fold_header = SC_FOLDLEVELHEADERFLAG or 0
1222    local scite_fold_white  = SC_FOLDLEVELWHITEFLAG  or 0
1223    local scite_fold_number = SC_FOLDLEVELNUMBERMASK or 0
1224
1225    local function styletonumbers(folding,hash)
1226        if not hash then
1227            hash = { }
1228        end
1229        if folding then
1230            for k, v in next, folding do
1231                local s = hash[k] or { }
1232                for k, v in next, v do
1233                    local n = numbers[k]
1234                    if n then
1235                        s[n] = v
1236                    end
1237                end
1238                hash[k] = s
1239            end
1240        end
1241        return hash
1242    end
1243
1244    local folders = setmetatable({ }, { __index = function(t, lexer)
1245        local folding = lexer.folding
1246        if folding then
1247            local foldmapping = styletonumbers(folding)
1248            local embedded    = lexer.embedded
1249            if embedded then
1250                for i=1,#embedded do
1251                    local embed = embedded[i]
1252                    local lexer = embed.lexer
1253                    if lexer then
1254                        foldmapping = styletonumbers(lexer.folding,foldmapping)
1255                    end
1256                end
1257            end
1258            local foldpattern = helpers.utfchartabletopattern(foldmapping)
1259            local resetparser = lexer.resetparser
1260            local line        = 0
1261            local current     = scite_fold_base
1262            local previous    = scite_fold_base
1263            --
1264            foldpattern = Cp() * (foldpattern/foldmapping) / function(s,match)
1265                if match then
1266                    local l = match[thestyleat[s + foldoffset - 1]]
1267                    if l then
1268                        current = current + l
1269                    end
1270                end
1271            end
1272            local action_yes = function()
1273                if current > previous then
1274                    previous = previous | scite_fold_header
1275                elseif current < scite_fold_base then
1276                    current = scite_fold_base
1277                end
1278                thelevelat[line] = previous
1279                previous = current
1280                line = line + 1
1281            end
1282            local action_nop = function()
1283                previous = previous | scite_fold_white
1284                thelevelat[line] = previous
1285                previous = current
1286                line = line + 1
1287            end
1288            --
1289            foldpattern = ((foldpattern + (1-newline))^1 * newline/action_yes + newline/action_nop)^0
1290            --
1291            folder = function(text,offset,initial)
1292                if reset_parser then
1293                    reset_parser()
1294                end
1295                foldoffset = offset
1296                nesting    = 0
1297                --
1298                previous   = scite_fold_base -- & scite_fold_number
1299                if foldoffset == 0 then
1300                    line = 0
1301                else
1302                    line = getlineat(editor,offset) & scite_fold_number -- scite is at the beginning of a line
1303                 -- previous = getlevelat(editor,line) -- alas
1304                    previous = thelevelat[line] -- zero/one
1305                end
1306                current = previous
1307                lpegmatch(foldpattern,text,initial)
1308            end
1309        else
1310            folder = function() end
1311        end
1312        t[lexer] = folder
1313        return folder
1314    end } )
1315
1316    -- can somehow be called twice (idem for the lexer)
1317
1318    local function scite_fold(lexer,text,offset,initial)
1319        if text ~= "" then
1320            return folders[lexer](text,offset,initial)
1321        end
1322    end
1323
1324    -- We cannot use the styler style setters so we use the editor ones. This has to do with the fact
1325    -- that the styler sees the (utf) encoding while we are doing bytes. There is also some initial
1326    -- skipping over characters. First versions uses those callers and had to offset by -2, but while
1327    -- that works with whole document lexing it doesn't work with partial lexing (one can also get
1328    -- multiple OnStyle calls per edit.
1329    --
1330    -- The backtracking here relates to the fact that we start at the outer lexer (otherwise embedded
1331    -- lexers can have occasional side effects. It also makes it possible to do better syntax checking
1332    -- on the fly (some day).
1333    --
1334    -- The (old) editor:textrange cannot handle nul characters. It that doesn't get patched in scite we
1335    -- need to use the styler variant (which is not in scite).
1336
1337    -- lexer    : context lexer
1338    -- editor   : scite editor object (needs checking every update)
1339    -- language : scite lexer language id
1340    -- filename : current file
1341    -- size     : size of current file
1342    -- start    ; first position where to edit
1343    -- length   : length stripe to edit
1344    -- trace    : flag that signals tracing
1345
1346    -- After quite some experiments with the styler methods I settled on the editor methods because
1347    -- these are not sensitive for utf and have no side effects like the two forward cursor positions.
1348
1349    function lexers.scite_onstyle(lexer,editor,partial,language,filename,size,start,length,trace)
1350        seteditor(editor)
1351        local clock   = trace and os.clock()
1352        local back, fore, lstart, llength, snippet, initial = scite_range(lexer,size,start,length,partial)
1353        if clock then
1354            report("lexing %s", language)
1355            report("  document file : %s", filename)
1356            report("  document size : %i", size)
1357            report("  styler start  : %i", start)
1358            report("  styler length : %i", length)
1359            report("  backtracking  : %i", back)
1360            report("  foretracking  : %i", fore)
1361            report("  lexer start   : %i", lstart)
1362            report("  lexer length  : %i", llength)
1363            report("  text length   : %i", #snippet)
1364            report("  lexing method : %s", partial and "partial" or "whole")
1365            report("  after copying : %0.3f seconds",os.clock()-clock)
1366        end
1367        scite_lex(lexer,snippet,lstart,initial)
1368        if clock then
1369            report("  after lexing  : %0.3f seconds",os.clock()-clock)
1370        end
1371        scite_fold(lexer,snippet,lstart,initial)
1372        if clock then
1373            report("  after folding : %0.3f seconds",os.clock()-clock)
1374        end
1375    end
1376
1377end
1378
1379-- end of scite editor lexer
1380
1381lexers.context = lexers -- for now
1382
1383return lexers
1384