scite-context-lexer.lua /size: 48 Kb    last modification: 2021-10-28 13:49
1local info = {
2    version   = 1.400,
3    comment   = "basics for scintilla lpeg lexer for context/metafun",
4    author    = "Hans Hagen, PRAGMA-ADE, Hasselt NL",
5    copyright = "PRAGMA ADE / ConTeXt Development Team",
6    license   = "see context related readme files",
7    comment   = "contains copyrighted code from mitchell.att.foicica.com",
8
9}
10
11-- There is some history behind these lexers. When LPEG came around, we immediately adopted that in CONTEXT
12-- and one of the first things to show up were the verbatim plugins. There we have several models: line based
13-- and syntax based. The way we visualize the syntax for TEX, METAPOST and LUA relates closely to the way the
14-- CONTEXT user interface evolved. We have LPEG all over the place.
15--
16-- When at some point it became possible to have an LPEG lexer in SCITE (by using the TEXTADEPT dll) I figured
17-- out a mix of what we had and what is needed there. The lexers that came with the dll were quite slow so in
18-- order to deal with the large \LUA\ data files I rewrote the lexing so that it did work with the dll but was
19-- useable otherwise too. There are quite some comments in the older files that explain these steps. However, it
20-- never became pretty and didn't always looked the way I wanted (read: more in tune with how we use LUA in
21-- CONTEXT). Over time the plugin evolved and the code was adapted (to some extend it became more like we already
22-- had) but when SCITE moved to version 5 (as part of a C++ update) and the dll again changed it became clear
23-- that we had to come up with a different approach. Not only the dll had to be kept in sync, but we also had to
24-- keep adapting interfaces. When SCITE changed to a new lexer framework some of the properties setup changed
25-- but after adapting that it still failed to load. I noticed some new directory scanning in the dll code which
26-- probably interferes with the weay we load. (I probably need to look into that but adapting the directory
27-- structure and adding some cheats is not what I like to do.)
28--
29-- The original plan was to have TEXTADEPT as fallback but at the pace it was evolving it was not something we
30-- could use yet. Because it was meant to be configurable we even had a stripped down, tuned for CONTEXT related
31-- document processing, interface defined. After all it is good to have a fallback in case SCITE fails. But keeping
32-- up with the changing interfaces made clear that it was not really meant for this (replacing components is hard
33-- and I assume it's more about adding stuff to the shipped editor, but more and more features is not what we need:
34-- editors quickly become too loaded by confusing features that make no sense when editing documents. We need
35-- something that is easy to use for novice (and occasional) users and SCITE always has been perfect for that. The
36-- nice thing about TEXTADEPT is that it supports more platforms, the nice thing about SCITE is that it is stable
37-- and small. I understand that the interplay between the scintilla and lexzilla and lexlpeg is subtle but because
38-- of that using it generic (other than texadept) is hard.
39--
40-- So, the question was: how to proceed. The main component missing in SCITE's LUA interface is LPEG. By adding
41-- that, plus a few bytewise styler helpers, I was able to use the lexers without the dll. The advantage of using
42-- the built in methods is that we (1) can use the same LUA instance that other script use, (2) have access to all
43-- kind of properties, (3) can have a cleaner implementation (for loading), (4) can make the code look better. In
44-- retrospect I should have done that long ago. In the end it turned out that the new implementaion is just as
45-- fast but also more memory efficient (the dll could occasionally crash on many open files and loading many files
46-- when restarting was pretty slow too probably because of excessive immediate lexing).
47--
48-- It will take a while to strip out all the artifacts needed for the dll based lexer but we'll get there. Because
49-- we also supported the regular lexers that came with the dll some keys got the names needed there but it no
50-- longer makes sense: we can use the built-in SCITE lexers for those. One of the things that is gone is the
51-- whitespace trickery: we always lex the whole document, as we already did most of the time (the only possible
52-- gain is when one is at the end of a document and then we observed side effects of not enough backtracking).
53--
54-- I will keep the old files archived so we can always use the (optimized) helpers from those if we ever need
55-- them. I could go back to the code we had before the dll came around but it makes no sense, so for now I just
56-- pruned and rewrote. The lexer definitions are still such that we could load other lexers but that compatbility
57-- has now been dropped so I might clean up that bit too. It's not that hard to write additional lexers if I need
58-- them.
59--
60-- We assume at least LUA 5.3 now (tests with LUA 5.4 demonstrated a 10% performance gain). I will also make a
61-- helper module that has all the nice CONTEXT functions available. Logging to file is gone because in SCITE we
62-- can write to the output pane. Actually: I'm still waiting for scite to overload that output pain lexer.
63--
64-- As mentioned, the dll based lexer uses whitespace to determine where to start and then only lexes what comes
65-- after it. In the mixed lexing that we use that hardly makes sense, because editing before the end still needs
66-- to backtrack. The question then becomes if we really save runtime. Also, we can be nested inside nested which
67-- never worked well but we can do that now. We also use one thems so there is no need to be more clever. We no
68-- longer keep the styles in a lexer simply because we use a consistent set and have plenty of styles in SCITE now.
69--
70-- The previous versions had way more code because we also could load the lexers shipped with the dll, had quite
71-- some optimizations and caching for older dll's and SCITE limitations, so the real tricks are in these old files.
72--
73-- We now can avoid the intermediate tables in SCITE and only use them when we lex in CONTEXT. So in the end we're
74-- back where we started more than a decade ago. It's a pitty that we dropped TEXTADEPT support but it was simply
75-- too hard to keep up. So be it. Maybe some day ... after all we still have the old code.
76--
77-- We had the lexers namespace plus additional tables and functions in the lexerx.context namespace in order not to
78-- overload 'original' functionality but the context subtable could go away.
79--
80-- Performance: I decided to go for whole document lexing every time which is fast enough for what we want. If a
81-- file is very (!) large one can always choose to "none" lexer in the interface. The advantage of whole parsing
82-- is that it is more robust than wildguessing on whitespace (which can fail occasionally), that we are less likely
83-- to crash after being in the editor for a whole day, and that preamble scanning etc is now more reliable. If
84-- needed I can figure out some gain (but a new and faster machine makes more sense). There is optional partial
85-- document lexing (under testing). In any case, the former slow loading many documents at startup delay is gone
86-- now (somehow it looked like all tabs were lexed when a document was opened).
87
88local global = _G
89
90local lpeg  = require("lpeg")
91
92if lpeg.setmaxstack then lpeg.setmaxstack(1000) end
93
94local gmatch, match, lower, upper, gsub, format = string.gmatch, string.match, string.lower, string.upper, string.gsub, string.format
95local concat, sort = table.concat, table.sort
96local type, next, setmetatable, tostring = type, next, setmetatable, tostring
97local R, P, S, C, Cp, Ct, Cmt, Cc, Cf, Cg, Cs = lpeg.R, lpeg.P, lpeg.S, lpeg.C, lpeg.Cp, lpeg.Ct, lpeg.Cmt, lpeg.Cc, lpeg.Cf, lpeg.Cg, lpeg.Cs
98local lpegmatch = lpeg.match
99
100local usage    = resolvers and "context" or "scite"
101local trace    = false
102local collapse = false -- can save some 15% (maybe easier on scintilla)
103
104local lexers     = { }
105local styles     = { }
106local numbers    = { }
107local helpers    = { }
108local patterns   = { }
109local usedlexers = { }
110
111lexers.usage     = usage
112
113lexers.helpers   = helpers
114lexers.styles    = styles
115lexers.numbers   = numbers
116lexers.patterns  = patterns
117
118-- Maybe at some point I will just load the basic mtx toolkit which gives a lot of benefits but for now we
119-- do with poor mans copies.
120--
121-- Some basic reporting.
122
123local report = logs and logs.reporter("scite lpeg lexer") or function(fmt,str,...)
124    if str then
125        fmt = format(fmt,str,...)
126    end
127    print(format("scite lpeg lexer > %s",fmt))
128end
129
130report("loading context lexer module")
131
132lexers.report = report
133
134local function sortedkeys(hash) -- simple version, good enough for here
135    local t, n = { }, 0
136    for k, v in next, hash do
137        t[#t+1] = k
138        local l = #tostring(k)
139        if l > n then
140            n = l
141        end
142    end
143    sort(t)
144    return t, n
145end
146
147helpers.sortedkeys = sortedkeys
148
149-- begin of patterns (we should take them from l-lpeg.lua)
150
151do
152
153    local anything             = P(1)
154    local idtoken              = R("az","AZ","\127\255","__")
155    local digit                = R("09")
156    local sign                 = S("+-")
157    local period               = P(".")
158    local octdigit             = R("07")
159    local hexdigit             = R("09","AF","af")
160    local lower                = R("az")
161    local upper                = R("AZ")
162    local alpha                = upper + lower
163    local space                = S(" \n\r\t\f\v")
164    local eol                  = S("\r\n")
165    local backslash            = P("\\")
166    local decimal              = digit^1
167    local octal                = P("0")
168                               * octdigit^1
169    local hexadecimal          = P("0") * S("xX")
170                               * (hexdigit^0 * period * hexdigit^1 + hexdigit^1 * period * hexdigit^0 + hexdigit^1)
171                               * (S("pP") * sign^-1 * hexdigit^1)^-1 -- *
172    local integer              = sign^-1
173                               * (hexadecimal + octal + decimal)
174    local float                = sign^-1
175                               * (digit^0 * period * digit^1 + digit^1 * period * digit^0 + digit^1)
176                               * S("eE") * sign^-1 * digit^1 -- *
177
178    patterns.idtoken           = idtoken
179    patterns.digit             = digit
180    patterns.sign              = sign
181    patterns.period            = period
182    patterns.octdigit          = octdigit
183    patterns.hexdigit          = hexdigit
184    patterns.ascii             = R("\000\127") -- useless
185    patterns.extend            = R("\000\255") -- useless
186    patterns.control           = R("\000\031")
187    patterns.lower             = lower
188    patterns.upper             = upper
189    patterns.alpha             = alpha
190    patterns.decimal           = decimal
191    patterns.octal             = octal
192    patterns.hexadecimal       = hexadecimal
193    patterns.float             = float
194    patterns.cardinal          = decimal
195
196    patterns.signeddecimal     = sign^-1 * decimal
197    patterns.signedoctal       = sign^-1 * octal
198    patterns.signedhexadecimal = sign^-1 * hexadecimal
199    patterns.integer           = integer
200    patterns.real              =
201        sign^-1 * (                    -- at most one
202            digit^1 * period * digit^0 -- 10.0 10.
203          + digit^0 * period * digit^1 -- 0.10 .10
204          + digit^1                    -- 10
205       )
206
207    patterns.anything          = anything
208    patterns.any               = anything
209    patterns.restofline        = (1-eol)^1
210    patterns.space             = space
211    patterns.spacing           = space^1
212    patterns.nospacing         = (1-space)^1
213    patterns.eol               = eol
214    patterns.newline           = P("\r\n") + eol
215    patterns.backslash         = backslash
216
217    local endof                = S("\n\r\f")
218
219    patterns.startofline       = P(function(input,index)
220        return (index == 1 or lpegmatch(endof,input,index-1)) and index
221    end)
222
223end
224
225do
226
227    local char     = string.char
228    local byte     = string.byte
229    local format   = format
230
231    local function utfchar(n)
232        if n < 0x80 then
233            return char(n)
234        elseif n < 0x800 then
235            return char(
236                0xC0 + (n//0x00040),
237                0x80 +  n           % 0x40
238            )
239        elseif n < 0x10000 then
240            return char(
241                0xE0 + (n//0x01000),
242                0x80 + (n//0x00040) % 0x40,
243                0x80 +  n           % 0x40
244            )
245        elseif n < 0x40000 then
246            return char(
247                0xF0 + (n//0x40000),
248                0x80 + (n//0x01000),
249                0x80 + (n//0x00040) % 0x40,
250                0x80 +  n           % 0x40
251            )
252        else
253         -- return char(
254         --     0xF1 + (n//0x1000000),
255         --     0x80 + (n//0x0040000),
256         --     0x80 + (n//0x0001000),
257         --     0x80 + (n//0x0000040) % 0x40,
258         --     0x80 +  n             % 0x40
259         -- )
260            return "?"
261        end
262    end
263
264    helpers.utfchar = utfchar
265
266    local utf8next         = R("\128\191")
267    local utf8one          = R("\000\127")
268    local utf8two          = R("\194\223") * utf8next
269    local utf8three        = R("\224\239") * utf8next * utf8next
270    local utf8four         = R("\240\244") * utf8next * utf8next * utf8next
271
272    local utfidentifier    = utf8two + utf8three + utf8four
273    helpers.utfidentifier  = (R("AZ","az","__")      + utfidentifier)
274                           * (R("AZ","az","__","09") + utfidentifier)^0
275
276    helpers.utfcharpattern = P(1) * utf8next^0 -- unchecked but fast
277    helpers.utfbytepattern = utf8one   / byte
278                           + utf8two   / function(s) local c1, c2         = byte(s,1,2) return   c1 * 64 + c2                       -    12416 end
279                           + utf8three / function(s) local c1, c2, c3     = byte(s,1,3) return  (c1 * 64 + c2) * 64 + c3            -   925824 end
280                           + utf8four  / function(s) local c1, c2, c3, c4 = byte(s,1,4) return ((c1 * 64 + c2) * 64 + c3) * 64 + c4 - 63447168 end
281
282    local p_false          = P(false)
283    local p_true           = P(true)
284
285    local function make(t)
286        local function making(t)
287            local p    = p_false
288            local keys = sortedkeys(t)
289            for i=1,#keys do
290                local k = keys[i]
291                if k ~= "" then
292                    local v = t[k]
293                    if v == true then
294                        p = p + P(k) * p_true
295                    elseif v == false then
296                        -- can't happen
297                    else
298                        p = p + P(k) * making(v)
299                    end
300                end
301            end
302            if t[""] then
303                p = p + p_true
304            end
305            return p
306        end
307        local p    = p_false
308        local keys = sortedkeys(t)
309        for i=1,#keys do
310            local k = keys[i]
311            if k ~= "" then
312                local v = t[k]
313                if v == true then
314                    p = p + P(k) * p_true
315                elseif v == false then
316                    -- can't happen
317                else
318                    p = p + P(k) * making(v)
319                end
320            end
321        end
322        return p
323    end
324
325    local function collapse(t,x)
326        if type(t) ~= "table" then
327            return t, x
328        else
329            local n = next(t)
330            if n == nil then
331                return t, x
332            elseif next(t,n) == nil then
333                -- one entry
334                local k = n
335                local v = t[k]
336                if type(v) == "table" then
337                    return collapse(v,x..k)
338                else
339                    return v, x .. k
340                end
341            else
342                local tt = { }
343                for k, v in next, t do
344                    local vv, kk = collapse(v,k)
345                    tt[kk] = vv
346                end
347                return tt, x
348            end
349        end
350    end
351
352    function helpers.utfchartabletopattern(list)
353        local tree = { }
354        local n = #list
355        if n == 0 then
356            for s in next, list do
357                local t = tree
358                local p, pk
359                for c in gmatch(s,".") do
360                    if t == true then
361                        t = { [c] = true, [""] = true }
362                        p[pk] = t
363                        p = t
364                        t = false
365                    elseif t == false then
366                        t = { [c] = false }
367                        p[pk] = t
368                        p = t
369                        t = false
370                    else
371                        local tc = t[c]
372                        if not tc then
373                            tc = false
374                            t[c] = false
375                        end
376                        p = t
377                        t = tc
378                    end
379                    pk = c
380                end
381                if t == false then
382                    p[pk] = true
383                elseif t == true then
384                    -- okay
385                else
386                    t[""] = true
387                end
388            end
389        else
390            for i=1,n do
391                local s = list[i]
392                local t = tree
393                local p, pk
394                for c in gmatch(s,".") do
395                    if t == true then
396                        t = { [c] = true, [""] = true }
397                        p[pk] = t
398                        p = t
399                        t = false
400                    elseif t == false then
401                        t = { [c] = false }
402                        p[pk] = t
403                        p = t
404                        t = false
405                    else
406                        local tc = t[c]
407                        if not tc then
408                            tc = false
409                            t[c] = false
410                        end
411                        p = t
412                        t = tc
413                    end
414                    pk = c
415                end
416                if t == false then
417                    p[pk] = true
418                elseif t == true then
419                    -- okay
420                else
421                    t[""] = true
422                end
423            end
424        end
425        collapse(tree,"")
426        return make(tree)
427    end
428
429    patterns.invisibles = helpers.utfchartabletopattern {
430        utfchar(0x00A0), -- nbsp
431        utfchar(0x2000), -- enquad
432        utfchar(0x2001), -- emquad
433        utfchar(0x2002), -- enspace
434        utfchar(0x2003), -- emspace
435        utfchar(0x2004), -- threeperemspace
436        utfchar(0x2005), -- fourperemspace
437        utfchar(0x2006), -- sixperemspace
438        utfchar(0x2007), -- figurespace
439        utfchar(0x2008), -- punctuationspace
440        utfchar(0x2009), -- breakablethinspace
441        utfchar(0x200A), -- hairspace
442        utfchar(0x200B), -- zerowidthspace
443        utfchar(0x202F), -- narrownobreakspace
444        utfchar(0x205F), -- math thinspace
445    }
446
447    -- now we can make:
448
449    patterns.wordtoken    = R("az","AZ","\127\255")
450    patterns.wordpattern  = patterns.wordtoken^3 -- todo: if limit and #s < limit then
451
452    patterns.iwordtoken   = patterns.wordtoken - patterns.invisibles
453    patterns.iwordpattern = patterns.iwordtoken^3
454
455end
456
457-- end of patterns
458
459-- begin of scite properties
460
461-- Because we use a limited number of lexers we can provide a new whitespace on demand. If needed
462-- we can recycle from a pool or we can just not reuse a lexer and load anew. I'll deal with that
463-- when the need is there. At that moment I might as well start working with nested tables (so that
464-- we have a langauge tree.
465
466local whitespace = function() return "whitespace" end
467
468local maxstyle    = 127 -- otherwise negative values in editor object -- 255
469local nesting     = 0
470local style_main  = 0
471local style_white = 0
472
473if usage == "scite" then
474
475    local names = { }
476    local props = { }
477    local count = 1
478
479    -- 32 -- 39 are reserved; we want to avoid holes so we preset:
480
481    for i=0,maxstyle do
482        numbers[i] = "default"
483    end
484
485    whitespace = function()
486        return style_main -- "mainspace"
487    end
488
489    function lexers.loadtheme(theme)
490        styles = theme or { }
491        for k, v in next, styles do
492            names[#names+1] = k
493        end
494        sort(names)
495        for i=1,#names do
496            local name = names[i]
497            styles[name].n = count
498            numbers[name] = count
499            numbers[count] = name
500            if count == 31 then
501                count = 40
502            else
503                count = count + 1
504            end
505        end
506        for i=1,#names do
507            local t = { }
508            local s = styles[names[i]]
509            local n = s.n
510            local fore = s.fore
511            local back = s.back
512            local font = s.font
513            local size = s.size
514            local bold = s.bold
515            if fore then
516                if #fore == 1 then
517                    t[#t+1] = format("fore:#%02X%02X%02X",fore[1],fore[1],fore[1])
518                elseif #fore == 3 then
519                    t[#t+1] = format("fore:#%02X%02X%02X",fore[1],fore[2],fore[3])
520                end
521            end
522            if back then
523                if #back == 1 then
524                    t[#t+1] = format("back:#%02X%02X%02X",back[1],back[1],back[1])
525                elseif #back == 3 then
526                    t[#t+1] = format("back:#%02X%02X%02X",back[1],back[2],back[3])
527                else
528                    t[#t+1] = "back:#000000"
529                end
530            end
531            if bold then
532                t[#t+1] = "bold"
533            end
534            if font then
535                t[#t+1] = format("font:%s",font)
536            end
537            if size then
538                t[#t+1] = format("size:%s",size)
539            end
540            if #t > 0 then
541                props[n] = concat(t,",")
542            end
543        end
544        setmetatable(styles, {
545            __index =
546                function(target,name)
547                    if name then
548                        count = count + 1
549                        if count > maxstyle then
550                            count = maxstyle
551                        end
552                        numbers[name] = count
553                        local style = { n = count }
554                        target[name] = style
555                        return style
556                    end
557                end
558        } )
559        lexers.styles  = styles
560        lexers.numbers = numbers
561
562        style_main  = styles.mainspace.n
563        style_white = styles.whitespace.n
564    end
565
566    function lexers.registertheme(properties,name)
567        for n, p in next, props do
568            local tag = "style.script_" .. name .. "." .. n
569            properties[tag] = p
570        end
571    end
572
573end
574
575-- end of scite properties
576
577-- begin of word matchers
578
579do
580
581  -- function patterns.exactmatch(words,case_insensitive)
582  --     local characters = concat(words)
583  --     local pattern = S(characters) + patterns.idtoken
584  --     if case_insensitive then
585  --         pattern = pattern + S(upper(characters)) + S(lower(characters))
586  --     end
587  --     if case_insensitive then
588  --         local list = { }
589  --         if #words == 0 then
590  --             for k, v in next, words do
591  --                 list[lower(k)] = v
592  --             end
593  --         else
594  --             for i=1,#words do
595  --                 list[lower(words[i])] = true
596  --             end
597  --         end
598  --         return Cmt(pattern^1, function(_,i,s)
599  --             return list[lower(s)] -- and i or nil
600  --         end)
601  --     else
602  --         local list = { }
603  --         if #words == 0 then
604  --             for k, v in next, words do
605  --                 list[k] = v
606  --             end
607  --         else
608  --             for i=1,#words do
609  --                 list[words[i]] = true
610  --             end
611  --         end
612  --         return Cmt(pattern^1, function(_,i,s)
613  --             return list[s] -- and i or nil
614  --         end)
615  --     end
616  -- end
617  --
618  -- function patterns.justmatch(words)
619  --     local p = P(words[1])
620  --     for i=2,#words do
621  --         p = p + P(words[i])
622  --     end
623  --     return p
624  -- end
625
626    -- we could do camelcase but that is not what users use for keywords
627
628    local p_finish = #(1 - R("az","AZ","__"))
629
630    patterns.finishmatch = p_finish
631
632    function patterns.exactmatch(words,ignorecase)
633        local list = { }
634        if ignorecase then
635            if #words == 0 then
636                for k, v in next, words do
637                    list[lower(k)] = v
638                end
639            else
640                for i=1,#words do
641                    list[lower(words[i])] = true
642                end
643            end
644            return Cmt(pattern^1, function(_,i,s)
645                return list[lower(s)] -- and i or nil
646            end)
647        else
648            if #words == 0 then
649                for k, v in next, words do
650                    list[k] = v
651                end
652            else
653                for i=1,#words do
654                    list[words[i]] = true
655                end
656            end
657        end
658        return helpers.utfchartabletopattern(list) * p_finish
659    end
660
661    patterns.justmatch = patterns.exactmatch
662
663end
664
665-- end of word matchers
666
667-- begin of loaders
668
669do
670
671    local cache = { }
672
673    function lexers.loadluafile(name)
674        local okay, data = pcall(require, name)
675        if data then
676            if trace then
677                report("lua file '%s' has been loaded",name)
678            end
679            return data, name
680        end
681        if trace then
682            report("unable to load lua file '%s'",name)
683        end
684    end
685
686    function lexers.loaddefinitions(name)
687        local data = cache[name]
688        if data then
689            if trace then
690                report("reusing definitions '%s'",name)
691            end
692            return data
693        elseif trace and data == false then
694            report("definitions '%s' were not found",name)
695        end
696        local okay, data = pcall(require, name)
697        if not data then
698            report("unable to load definition file '%s'",name)
699            data = false
700        elseif trace then
701            report("definition file '%s' has been loaded",name)
702        end
703        cache[name] = data
704        return type(data) == "table" and data
705    end
706
707end
708
709-- end of loaders
710
711-- begin of spell checking (todo: pick files from distribution instead)
712
713do
714
715    -- spell checking (we can only load lua files)
716    --
717    -- return {
718    --     min   = 3,
719    --     max   = 40,
720    --     n     = 12345,
721    --     words = {
722    --         ["someword"]    = "someword",
723    --         ["anotherword"] = "Anotherword",
724    --     },
725    -- }
726
727    local lists    = { }
728    local disabled = false
729
730    function lexers.disablewordcheck()
731        disabled = true
732    end
733
734    function lexers.setwordlist(tag,limit) -- returns hash (lowercase keys and original values)
735        if not tag or tag == "" then
736            return false, 3
737        end
738        local list = lists[tag]
739        if not list then
740            list = lexers.loaddefinitions("spell-" .. tag)
741            if not list or type(list) ~= "table" then
742                report("invalid spell checking list for '%s'",tag)
743                list = { words = false, min = 3 }
744            else
745                list.words = list.words or false
746                list.min   = list.min or 3
747            end
748            lists[tag] = list
749        end
750        if trace then
751            report("enabling spell checking for '%s' with minimum '%s'",tag,list.min)
752        end
753        return list.words, list.min
754    end
755
756    if usage ~= "scite" then
757
758        function lexers.styleofword(validwords,validminimum,s,p)
759            if not validwords or #s < validminimum then
760                return "text", p
761            else
762                -- keys are lower
763                local word = validwords[s]
764                if word == s then
765                    return "okay", p -- exact match
766                elseif word then
767                    return "warning", p -- case issue
768                else
769                    local word = validwords[lower(s)]
770                    if word == s then
771                        return "okay", p -- exact match
772                    elseif word then
773                        return "warning", p -- case issue
774                    elseif upper(s) == s then
775                        return "warning", p -- probably a logo or acronym
776                    else
777                        return "error", p
778                    end
779                end
780            end
781        end
782
783    end
784
785end
786
787-- end of spell checking
788
789-- begin lexer management
790
791lexers.structured = false
792-- lexers.structured = true -- the future for the typesetting end
793
794do
795
796    function lexers.new(name,filename)
797        if not filename then
798            filename = false
799        end
800        local lexer = {
801            name       = name,
802            filename   = filename,
803            whitespace = whitespace()
804        }
805        if trace then
806            report("initializing lexer tagged '%s' from file '%s'",name,filename or name)
807        end
808        return lexer
809    end
810
811    if usage == "scite" then
812
813        -- overloaded later
814
815        function lexers.token(name, pattern)
816            local s = styles[name] -- always something anyway
817            return pattern * Cc(s and s.n or 32) * Cp()
818        end
819
820    else
821
822        function lexers.token(name, pattern)
823            return pattern * Cc(name) * Cp()
824        end
825
826    end
827
828    -- todo: variant that directly styles
829
830    local function append(pattern,step)
831        if not step then
832            return pattern
833        elseif pattern then
834            return pattern + P(step)
835        else
836            return P(step)
837        end
838    end
839
840    local function prepend(pattern,step)
841        if not step then
842            return pattern
843        elseif pattern then
844            return P(step) + pattern
845        else
846            return P(step)
847        end
848    end
849
850    local wrapup = usage == "scite" and
851        function(name,pattern)
852            return pattern
853        end
854    or
855        function(name,pattern,nested)
856            if lexers.structured then
857                return Cf ( Ct("") * Cg(Cc("name") * Cc(name)) * Cg(Cc("data") * Ct(pattern)), rawset)
858            elseif nested then
859                return pattern
860            else
861                return Ct (pattern)
862            end
863        end
864
865    local function construct(namespace,lexer,level)
866        if lexer then
867            local rules    = lexer.rules
868            local embedded = lexer.embedded
869            local grammar  = nil
870            if embedded then
871                for i=1,#embedded do
872                    local embed = embedded[i]
873                    local done  = embed.done
874                    if not done then
875                        local lexer = embed.lexer
876                        local start = embed.start
877                        local stop  = embed.stop
878                        if usage == "scite" then
879                            start = start / function() nesting = nesting + 1 end
880                            stop  = stop  / function() nesting = nesting - 1 end
881                        end
882                        if trace then
883                            start = start / function() report("    nested lexer %s: start",lexer.name) end
884                            stop  = stop  / function() report("    nested lexer %s: stop", lexer.name) end
885                        end
886                        done = start * (construct(namespace,lexer,level+1) - stop)^0 * stop
887                        done = wrapup(lexer.name,done,true)
888                    end
889               -- grammar = prepend(grammar, done)
890                  grammar = append(grammar, done)
891                end
892            end
893            if rules then
894                for i=1,#rules do
895                    grammar = append(grammar,rules[i][2])
896                end
897            end
898            return grammar
899        end
900    end
901
902    function lexers.load(filename,namespace)
903        if not namespace then
904            namespace = filename
905        end
906        local lexer = usedlexers[namespace] -- we load by filename but the internal name can be short
907        if lexer then
908            if trace then
909                report("reusing lexer '%s'",namespace)
910            end
911            return lexer
912        elseif trace then
913            report("loading lexer '%s' from '%s'",namespace,filename)
914        end
915        local lexer, name = lexers.loadluafile(filename)
916        if not lexer then
917            report("invalid lexer file '%s'",filename)
918        elseif type(lexer) ~= "table" then
919            if trace then
920                report("lexer file '%s' gets a dummy lexer",filename)
921            end
922            return lexers.new(filename)
923        end
924        local grammar = construct(namespace,lexer,1)
925        if grammar then
926            grammar = wrapup(namespace,grammar^0)
927            lexer.grammar = grammar
928        end
929        --
930        local backtracker = lexer.backtracker
931        local foretracker = lexer.foretracker
932        if backtracker then
933            local start    = 1
934            local position = 1
935            local pattern  = (Cmt(Cs(backtracker),function(s,p,m) if p > start then return #s else position = p - #m end end) + P(1))^1
936            lexer.backtracker = function(str,offset)
937                position = 1
938                start    = offset
939                lpegmatch(pattern,str,1)
940                return position
941            end
942        end
943        if foretracker then
944            local start    = 1
945            local position = 1
946            local pattern  = (Cmt(Cs(foretracker),function(s,p,m) position = p - #m return #s end) + P(1))^1
947            lexer.foretracker = function(str,offset)
948                position = offset
949                start    = offset
950                lpegmatch(pattern,str,position)
951                return position
952            end
953        end
954        --
955        usedlexers[filename] = lexer
956        return lexer
957    end
958
959    function lexers.embed(parent, embed, start, stop, rest)
960        local embedded = parent.embedded
961        if not embedded then
962            embedded        = { }
963            parent.embedded = embedded
964        end
965        embedded[#embedded+1] = {
966            lexer = embed,
967            start = start,
968            stop  = stop,
969            rest  = rest,
970        }
971    end
972
973end
974
975-- end lexer management
976
977-- This will become a configurable option (whole is more reliable but it can
978-- be slow on those 5 megabyte lua files):
979
980-- begin of context typesetting lexer
981
982if usage ~= "scite" then
983
984    local function collapsed(t)
985        local lasttoken = nil
986        local lastindex = nil
987        for i=1,#t,2 do
988            local token    = t[i]
989            local position = t[i+1]
990            if token == lasttoken then
991                t[lastindex] = position
992            elseif lastindex then
993                lastindex = lastindex + 1
994                t[lastindex] = token
995                lastindex = lastindex + 1
996                t[lastindex] = position
997                lasttoken = token
998            else
999                lastindex = i+1
1000                lasttoken = token
1001            end
1002        end
1003        for i=#t,lastindex+1,-1 do
1004            t[i] = nil
1005        end
1006        return t
1007    end
1008
1009    function lexers.lex(lexer,text) -- get rid of init_style
1010        local grammar = lexer.grammar
1011        if grammar then
1012            nesting = 0
1013            if trace then
1014                report("lexing '%s' string with length %i",lexer.name,#text)
1015            end
1016            local t = lpegmatch(grammar,text)
1017            if collapse then
1018                t = collapsed(t)
1019            end
1020            return t
1021        else
1022            return { }
1023        end
1024    end
1025
1026end
1027
1028-- end of context typesetting lexer
1029
1030-- begin of scite editor lexer
1031
1032if usage == "scite" then
1033
1034    -- For char-def.lua we need some 0.55 s with Lua 5.3 and 10% less with Lua 5.4 (timed on a 2013
1035    -- Dell precision with i7-3840QM). That test file has 271540 lines of Lua (table) code and is
1036    -- 5.312.665 bytes large (dd 2021.09.29). The three methods perform about the same but the more
1037    -- direct approach saves some tables. Using the new Lua garbage collector makes no difference.
1038    --
1039    -- We can actually integrate folding in here if we want but it might become messy as we then
1040    -- also need to deal with specific newlines. We can also (in scite) store some extra state wrt
1041    -- the language used.
1042    --
1043    -- Operating on a range (as in the past) is faster when editing very large documents but we
1044    -- don't do that often. The problem is that backtracking over whitespace is tricky for some
1045    -- nested lexers.
1046
1047    local editor       = false
1048    local startstyling = false   -- editor:StartStyling(position,style)
1049    local setstyling   = false   -- editor:SetStyling(slice,style)
1050    local getlevelat   = false   -- editor.StyleAt[position] or StyleAt(editor,position)
1051    local getlineat    = false
1052    local thestyleat   = false   -- editor.StyleAt[position]
1053    local thelevelat   = false
1054
1055    local styleoffset  = 1
1056    local foldoffset   = 0
1057
1058    local function seteditor(usededitor)
1059        editor       = usededitor
1060        startstyling = editor.StartStyling
1061        setstyling   = editor.SetStyling
1062        getlevelat   = editor.FoldLevel        -- GetLevelAt
1063        getlineat    = editor.LineFromPosition
1064        thestyleat   = editor.StyleAt
1065        thelevelat   = editor.FoldLevel        -- SetLevelAt
1066    end
1067
1068    function lexers.token(style, pattern)
1069        if type(style) ~= "number" then
1070            style = styles[style] -- always something anyway
1071            style = style and style.n or 32
1072        end
1073        return pattern * Cp() / function(p)
1074            local n = p - styleoffset
1075            if nesting > 0 and style == style_main then
1076                style = style_white
1077            end
1078            setstyling(editor,n,style)
1079            styleoffset = styleoffset + n
1080        end
1081    end
1082
1083    -- used in: tex txt xml
1084
1085    function lexers.styleofword(validwords,validminimum,s,p)
1086        local style
1087        if not validwords or #s < validminimum then
1088            style = numbers.text
1089        else
1090            -- keys are lower
1091            local word = validwords[s]
1092            if word == s then
1093                style = numbers.okay -- exact match
1094            elseif word then
1095                style = numbers.warning -- case issue
1096            else
1097                local word = validwords[lower(s)]
1098                if word == s then
1099                    style = numbers.okay -- exact match
1100                elseif word then
1101                    style = numbers.warning -- case issue
1102                elseif upper(s) == s then
1103                    style = numbers.warning -- probably a logo or acronym
1104                else
1105                    style = numbers.error
1106                end
1107            end
1108        end
1109        local n = p - styleoffset
1110        setstyling(editor,n,style)
1111        styleoffset = styleoffset + n
1112    end
1113
1114    -- when we have an embedded language we can not rely on the range that
1115    -- scite provides because we need to look further
1116
1117    -- it looks like scite starts before the cursor / insert
1118
1119    local function scite_range(lexer,size,start,length,partial) -- set editor
1120        if partial then
1121            local backtracker = lexer.backtracker
1122            local foretracker = lexer.foretracker
1123            if start == 0 and size == length then
1124                -- see end
1125            elseif (backtracker or foretracker) and start > 0 then
1126                local snippet = editor:textrange(0,size)
1127                if size ~= length then
1128                    -- only lstart matters, the rest is statistics; we operate on 1-based strings
1129                    local lstart = backtracker and backtracker(snippet,start+1) or 0
1130                    local lstop  = foretracker and foretracker(snippet,start+1+length) or size
1131                    if lstart > 0 then
1132                        lstart = lstart - 1
1133                    end
1134                    if lstop > size then
1135                        lstop = size - 1
1136                    end
1137                    local stop    = start + length
1138                    local back    = start - lstart
1139                    local fore    = lstop - stop
1140                    local llength = lstop - lstart + 1
1141                 -- snippet = string.sub(snippet,lstart+1,lstop+1) -- we can return the initial position in the lpegmatch
1142                 -- return back, fore, lstart, llength, snippet, lstart + 1
1143                    return back, fore, 0, llength, snippet, lstart + 1
1144                else
1145                    return 0, 0, 0, size, snippet, 1
1146                end
1147            else
1148                -- still not entirely okay (nested mp)
1149                local stop   = start + length
1150                local lstart = start
1151                local lstop  = stop
1152                while lstart > 0 do
1153                    if thestyleat[lstart] == style_main then
1154                        break
1155                    else
1156                        lstart = lstart - 1
1157                    end
1158                end
1159                if lstart < 0 then
1160                    lstart = 0
1161                end
1162                while lstop < size do
1163                    if thestyleat[lstop] == style_main then
1164                        break
1165                    else
1166                        lstop = lstop + 1
1167                    end
1168                end
1169                if lstop > size then
1170                    lstop = size
1171                end
1172                local back    = start - lstart
1173                local fore    = lstop - stop
1174                local llength = lstop - lstart + 1
1175                local snippet = editor:textrange(lstart,lstop)
1176                if llength > #snippet then
1177                    llength = #snippet
1178                end
1179                return back, fore, lstart, llength, snippet, 1
1180            end
1181        end
1182        local snippet = editor:textrange(0,size)
1183        return 0, 0, 0, size, snippet, 1
1184    end
1185
1186    local function scite_lex(lexer,text,offset,initial)
1187        local grammar = lexer.grammar
1188        if grammar then
1189            styleoffset = 1
1190            nesting     = 0
1191            startstyling(editor,offset,32)
1192            local preamble = lexer.preamble
1193            if preamble then
1194                lpegmatch(preamble,offset == 0 and text or editor:textrange(0,500))
1195            end
1196            lpegmatch(grammar,text,initial)
1197        end
1198    end
1199
1200    -- We can assume sane definitions that is: must languages use similar constructs for the start
1201    -- and end of something. So we don't need to waste much time on nested lexers.
1202
1203    local newline           = patterns.newline
1204
1205    local scite_fold_base   = SC_FOLDLEVELBASE       or 0
1206    local scite_fold_header = SC_FOLDLEVELHEADERFLAG or 0
1207    local scite_fold_white  = SC_FOLDLEVELWHITEFLAG  or 0
1208    local scite_fold_number = SC_FOLDLEVELNUMBERMASK or 0
1209
1210    local function styletonumbers(folding,hash)
1211        if not hash then
1212            hash = { }
1213        end
1214        if folding then
1215            for k, v in next, folding do
1216                local s = hash[k] or { }
1217                for k, v in next, v do
1218                    local n = numbers[k]
1219                    if n then
1220                        s[n] = v
1221                    end
1222                end
1223                hash[k] = s
1224            end
1225        end
1226        return hash
1227    end
1228
1229    local folders = setmetatable({ }, { __index = function(t, lexer)
1230        local folding = lexer.folding
1231        if folding then
1232            local foldmapping = styletonumbers(folding)
1233            local embedded    = lexer.embedded
1234            if embedded then
1235                for i=1,#embedded do
1236                    local embed = embedded[i]
1237                    local lexer = embed.lexer
1238                    if lexer then
1239                        foldmapping = styletonumbers(lexer.folding,foldmapping)
1240                    end
1241                end
1242            end
1243            local foldpattern = helpers.utfchartabletopattern(foldmapping)
1244            local resetparser = lexer.resetparser
1245            local line        = 0
1246            local current     = scite_fold_base
1247            local previous    = scite_fold_base
1248            --
1249            foldpattern = Cp() * (foldpattern/foldmapping) / function(s,match)
1250                if match then
1251                    local l = match[thestyleat[s + foldoffset - 1]]
1252                    if l then
1253                        current = current + l
1254                    end
1255                end
1256            end
1257            local action_yes = function()
1258                if current > previous then
1259                    previous = previous | scite_fold_header
1260                elseif current < scite_fold_base then
1261                    current = scite_fold_base
1262                end
1263                thelevelat[line] = previous
1264                previous = current
1265                line = line + 1
1266            end
1267            local action_nop = function()
1268                previous = previous | scite_fold_white
1269                thelevelat[line] = previous
1270                previous = current
1271                line = line + 1
1272            end
1273            --
1274            foldpattern = ((foldpattern + (1-newline))^1 * newline/action_yes + newline/action_nop)^0
1275            --
1276            folder = function(text,offset,initial)
1277                if reset_parser then
1278                    reset_parser()
1279                end
1280                foldoffset = offset
1281                nesting    = 0
1282                --
1283                previous   = scite_fold_base -- & scite_fold_number
1284                if foldoffset == 0 then
1285                    line = 0
1286                else
1287                    line = getlineat(editor,offset) & scite_fold_number -- scite is at the beginning of a line
1288                 -- previous = getlevelat(editor,line) -- alas
1289                    previous = thelevelat[line] -- zero/one
1290                end
1291                current = previous
1292                lpegmatch(foldpattern,text,initial)
1293            end
1294        else
1295            folder = function() end
1296        end
1297        t[lexer] = folder
1298        return folder
1299    end } )
1300
1301    -- can somehow be called twice (idem for the lexer)
1302
1303    local function scite_fold(lexer,text,offset,initial)
1304        if text ~= "" then
1305            return folders[lexer](text,offset,initial)
1306        end
1307    end
1308
1309    -- We cannot use the styler style setters so we use the editor ones. This has to do with the fact
1310    -- that the styler sees the (utf) encoding while we are doing bytes. There is also some initial
1311    -- skipping over characters. First versions uses those callers and had to offset by -2, but while
1312    -- that works with whole document lexing it doesn't work with partial lexing (one can also get
1313    -- multiple OnStyle calls per edit.
1314    --
1315    -- The backtracking here relates to the fact that we start at the outer lexer (otherwise embedded
1316    -- lexers can have occasional side effects. It also makes it possible to do better syntax checking
1317    -- on the fly (some day).
1318    --
1319    -- The (old) editor:textrange cannot handle nul characters. It that doesn't get patched in scite we
1320    -- need to use the styler variant (which is not in scite).
1321
1322    -- lexer    : context lexer
1323    -- editor   : scite editor object (needs checking every update)
1324    -- language : scite lexer language id
1325    -- filename : current file
1326    -- size     : size of current file
1327    -- start    ; first position where to edit
1328    -- length   : length stripe to edit
1329    -- trace    : flag that signals tracing
1330
1331    -- After quite some experiments with the styler methods I settled on the editor methods because
1332    -- these are not sensitive for utf and have no side effects like the two forward cursor positions.
1333
1334    function lexers.scite_onstyle(lexer,editor,partial,language,filename,size,start,length,trace)
1335        seteditor(editor)
1336        local clock   = trace and os.clock()
1337        local back, fore, lstart, llength, snippet, initial = scite_range(lexer,size,start,length,partial)
1338        if clock then
1339            report("lexing %s", language)
1340            report("  document file : %s", filename)
1341            report("  document size : %i", size)
1342            report("  styler start  : %i", start)
1343            report("  styler length : %i", length)
1344            report("  backtracking  : %i", back)
1345            report("  foretracking  : %i", fore)
1346            report("  lexer start   : %i", lstart)
1347            report("  lexer length  : %i", llength)
1348            report("  text length   : %i", #snippet)
1349            report("  lexing method : %s", partial and "partial" or "whole")
1350            report("  after copying : %0.3f seconds",os.clock()-clock)
1351        end
1352        scite_lex(lexer,snippet,lstart,initial)
1353        if clock then
1354            report("  after lexing  : %0.3f seconds",os.clock()-clock)
1355        end
1356        scite_fold(lexer,snippet,lstart,initial)
1357        if clock then
1358            report("  after folding : %0.3f seconds",os.clock()-clock)
1359        end
1360    end
1361
1362end
1363
1364-- end of scite editor lexer
1365
1366lexers.context = lexers -- for now
1367
1368return lexers
1369