scite-context-lexer.lua /size: 49 Kb    last modification: 2025-02-21 11:03
1local info = {
2    version   = 1.400,
3    comment   = "basics for scintilla lpeg lexer for context/metafun",
4    author    = "Hans Hagen, PRAGMA-ADE, Hasselt NL",
5    copyright = "PRAGMA ADE / ConTeXt Development Team",
6    license   = "see context related readme files",
7    comment   = "contains copyrighted code from mitchell.att.foicica.com",
8
9}
10
11-- There is some history behind these lexers. When LPEG came around, we immediately adopted that in CONTEXT
12-- and one of the first things to show up were the verbatim plugins. There we have several models: line based
13-- and syntax based. The way we visualize the syntax for TEX, METAPOST and LUA relates closely to the way the
14-- CONTEXT user interface evolved. We have LPEG all over the place.
15--
16-- When at some point it became possible to have an LPEG lexer in SCITE (by using the TEXTADEPT dll) I figured
17-- out a mix of what we had and what is needed there. The lexers that came with the dll were quite slow so in
18-- order to deal with the large \LUA\ data files I rewrote the lexing so that it did work with the dll but was
19-- useable otherwise too. There are quite some comments in the older files that explain these steps. However, it
20-- never became pretty and didn't always looked the way I wanted (read: more in tune with how we use LUA in
21-- CONTEXT). Over time the plugin evolved and the code was adapted (to some extend it became more like we already
22-- had) but when SCITE moved to version 5 (as part of a C++ update) and the dll again changed it became clear
23-- that we had to come up with a different approach. Not only the dll had to be kept in sync, but we also had to
24-- keep adapting interfaces. When SCITE changed to a new lexer framework some of the properties setup changed
25-- but after adapting that it still failed to load. I noticed some new directory scanning in the dll code which
26-- probably interferes with the weay we load. (I probably need to look into that but adapting the directory
27-- structure and adding some cheats is not what I like to do.)
28--
29-- The original plan was to have TEXTADEPT as fallback but at the pace it was evolving it was not something we
30-- could use yet. Because it was meant to be configurable we even had a stripped down, tuned for CONTEXT related
31-- document processing, interface defined. After all it is good to have a fallback in case SCITE fails. But keeping
32-- up with the changing interfaces made clear that it was not really meant for this (replacing components is hard
33-- and I assume it's more about adding stuff to the shipped editor, but more and more features is not what we need:
34-- editors quickly become too loaded by confusing features that make no sense when editing documents. We need
35-- something that is easy to use for novice (and occasional) users and SCITE always has been perfect for that. The
36-- nice thing about TEXTADEPT is that it supports more platforms, the nice thing about SCITE is that it is stable
37-- and small. I understand that the interplay between the scintilla and lexzilla and lexlpeg is subtle but because
38-- of that using it generic (other than texadept) is hard.
39--
40-- So, the question was: how to proceed. The main component missing in SCITE's LUA interface is LPEG. By adding
41-- that, plus a few bytewise styler helpers, I was able to use the lexers without the dll. The advantage of using
42-- the built in methods is that we (1) can use the same LUA instance that other script use, (2) have access to all
43-- kind of properties, (3) can have a cleaner implementation (for loading), (4) can make the code look better. In
44-- retrospect I should have done that long ago. In the end it turned out that the new implementaion is just as
45-- fast but also more memory efficient (the dll could occasionally crash on many open files and loading many files
46-- when restarting was pretty slow too probably because of excessive immediate lexing).
47--
48-- It will take a while to strip out all the artifacts needed for the dll based lexer but we'll get there. Because
49-- we also supported the regular lexers that came with the dll some keys got the names needed there but it no
50-- longer makes sense: we can use the built-in SCITE lexers for those. One of the things that is gone is the
51-- whitespace trickery: we always lex the whole document, as we already did most of the time (the only possible
52-- gain is when one is at the end of a document and then we observed side effects of not enough backtracking).
53--
54-- I will keep the old files archived so we can always use the (optimized) helpers from those if we ever need
55-- them. I could go back to the code we had before the dll came around but it makes no sense, so for now I just
56-- pruned and rewrote. The lexer definitions are still such that we could load other lexers but that compatbility
57-- has now been dropped so I might clean up that bit too. It's not that hard to write additional lexers if I need
58-- them.
59--
60-- We assume at least LUA 5.3 now (tests with LUA 5.4 demonstrated a 10% performance gain). I will also make a
61-- helper module that has all the nice CONTEXT functions available. Logging to file is gone because in SCITE we
62-- can write to the output pane. Actually: I'm still waiting for scite to overload that output pain lexer.
63--
64-- As mentioned, the dll based lexer uses whitespace to determine where to start and then only lexes what comes
65-- after it. In the mixed lexing that we use that hardly makes sense, because editing before the end still needs
66-- to backtrack. The question then becomes if we really save runtime. Also, we can be nested inside nested which
67-- never worked well but we can do that now. We also use one thems so there is no need to be more clever. We no
68-- longer keep the styles in a lexer simply because we use a consistent set and have plenty of styles in SCITE now.
69--
70-- The previous versions had way more code because we also could load the lexers shipped with the dll, had quite
71-- some optimizations and caching for older dll's and SCITE limitations, so the real tricks are in these old files.
72--
73-- We now can avoid the intermediate tables in SCITE and only use them when we lex in CONTEXT. So in the end we're
74-- back where we started more than a decade ago. It's a pitty that we dropped TEXTADEPT support but it was simply
75-- too hard to keep up. So be it. Maybe some day ... after all we still have the old code.
76--
77-- We had the lexers namespace plus additional tables and functions in the lexerx.context namespace in order not to
78-- overload 'original' functionality but the context subtable could go away.
79--
80-- Performance: I decided to go for whole document lexing every time which is fast enough for what we want. If a
81-- file is very (!) large one can always choose to "none" lexer in the interface. The advantage of whole parsing
82-- is that it is more robust than wildguessing on whitespace (which can fail occasionally), that we are less likely
83-- to crash after being in the editor for a whole day, and that preamble scanning etc is now more reliable. If
84-- needed I can figure out some gain (but a new and faster machine makes more sense). There is optional partial
85-- document lexing (under testing). In any case, the former slow loading many documents at startup delay is gone
86-- now (somehow it looked like all tabs were lexed when a document was opened).
87
88local global = _G
89
90local lpeg  = require("lpeg")
91
92if lpeg.setmaxstack then lpeg.setmaxstack(1000) end
93
94local gmatch, match, lower, upper, gsub, format = string.gmatch, string.match, string.lower, string.upper, string.gsub, string.format
95local concat, sort = table.concat, table.sort
96local type, next, setmetatable, tostring = type, next, setmetatable, tostring
97local R, P, S, C, Cp, Ct, Cmt, Cc, Cf, Cg, Cs = lpeg.R, lpeg.P, lpeg.S, lpeg.C, lpeg.Cp, lpeg.Ct, lpeg.Cmt, lpeg.Cc, lpeg.Cf, lpeg.Cg, lpeg.Cs
98local lpegmatch = lpeg.match
99
100local usage    = resolvers and "context" or "scite"
101local trace    = false
102local collapse = false -- can save some 15% (maybe easier on scintilla)
103
104local lexers     = { }
105local styles     = { }
106local numbers    = { }
107local helpers    = { }
108local patterns   = { }
109local usedlexers = { }
110
111lexers.usage     = usage
112
113lexers.helpers   = helpers
114lexers.styles    = styles
115lexers.numbers   = numbers
116lexers.patterns  = patterns
117
118-- Maybe at some point I will just load the basic mtx toolkit which gives a lot of benefits but for now we
119-- do with poor mans copies.
120--
121-- Some basic reporting.
122
123local report = logs and logs.reporter("scite lpeg lexer") or function(fmt,str,...)
124    if str then
125        fmt = format(fmt,str,...)
126    end
127    print(format("scite lpeg lexer > %s",fmt))
128end
129
130report("loading context lexer module")
131
132lexers.report = report
133
134local function sortedkeys(hash) -- simple version, good enough for here
135    local t, n = { }, 0
136    for k, v in next, hash do
137        t[#t+1] = k
138        local l = #tostring(k)
139        if l > n then
140            n = l
141        end
142    end
143    sort(t)
144    return t, n
145end
146
147helpers.sortedkeys = sortedkeys
148
149-- begin of patterns (we should take them from l-lpeg.lua)
150
151do
152
153    local anything             = P(1)
154    local idtoken              = R("az","AZ","\127\255","__")
155    local digit                = R("09")
156    local sign                 = S("+-")
157    local period               = P(".")
158    local octdigit             = R("07")
159    local hexdigit             = R("09","AF","af")
160    local lower                = R("az")
161    local upper                = R("AZ")
162    local alpha                = upper + lower
163    local space                = S(" \n\r\t\f\v")
164    local eol                  = S("\r\n")
165    local backslash            = P("\\")
166    local decimal              = digit^1
167    local octal                = P("0")
168                               * octdigit^1
169    local hexadecimal          = P("0") * S("xX")
170                               * (hexdigit^0 * period * hexdigit^1 + hexdigit^1 * period * hexdigit^0 + hexdigit^1)
171                               * (S("pP") * sign^-1 * hexdigit^1)^-1 -- *
172    local integer              = sign^-1
173                               * (hexadecimal + octal + decimal)
174    local float                = sign^-1
175                               * (digit^0 * period * digit^1 + digit^1 * period * digit^0 + digit^1)
176                               * S("eE") * sign^-1 * digit^1 -- *
177
178    patterns.idtoken           = idtoken
179    patterns.digit             = digit
180    patterns.sign              = sign
181    patterns.period            = period
182    patterns.octdigit          = octdigit
183    patterns.hexdigit          = hexdigit
184    patterns.ascii             = R("\000\127") -- useless
185    patterns.extend            = R("\000\255") -- useless
186    patterns.control           = R("\000\031")
187    patterns.lower             = lower
188    patterns.upper             = upper
189    patterns.alpha             = alpha
190    patterns.decimal           = decimal
191    patterns.octal             = octal
192    patterns.hexadecimal       = hexadecimal
193    patterns.float             = float
194    patterns.cardinal          = decimal
195
196    local utf8next             = R("\128\191")
197
198    patterns.utf8next          = utf8next
199    patterns.utf8one           = R("\000\127")
200    patterns.utf8two           = R("\194\223") * utf8next
201    patterns.utf8three         = R("\224\239") * utf8next * utf8next
202    patterns.utf8four          = R("\240\244") * utf8next * utf8next * utf8next
203
204    patterns.signeddecimal     = sign^-1 * decimal
205    patterns.signedoctal       = sign^-1 * octal
206    patterns.signedhexadecimal = sign^-1 * hexadecimal
207    patterns.integer           = integer
208    patterns.real              =
209        sign^-1 * (                    -- at most one
210            digit^1 * period * digit^0 -- 10.0 10.
211          + digit^0 * period * digit^1 -- 0.10 .10
212          + digit^1                    -- 10
213       )
214
215    patterns.anything          = anything
216    patterns.any               = anything
217    patterns.restofline        = (1-eol)^1
218    patterns.space             = space
219    patterns.spacing           = space^1
220    patterns.nospacing         = (1-space)^1
221    patterns.eol               = eol
222    patterns.newline           = P("\r\n") + eol
223    patterns.backslash         = backslash
224
225    local endof                = S("\n\r\f")
226
227    patterns.startofline       = P(function(input,index)
228        return (index == 1 or lpegmatch(endof,input,index-1)) and index
229    end)
230
231end
232
233do
234
235    local char     = string.char
236    local byte     = string.byte
237    local format   = format
238
239    local function utfchar(n)
240        if n < 0x80 then
241            return char(n)
242        elseif n < 0x800 then
243            return char(
244                0xC0 + (n//0x00040),
245                0x80 +  n           % 0x40
246            )
247        elseif n < 0x10000 then
248            return char(
249                0xE0 + (n//0x01000),
250                0x80 + (n//0x00040) % 0x40,
251                0x80 +  n           % 0x40
252            )
253        elseif n < 0x40000 then
254            return char(
255                0xF0 + (n//0x40000),
256                0x80 + (n//0x01000),
257                0x80 + (n//0x00040) % 0x40,
258                0x80 +  n           % 0x40
259            )
260        else
261         -- return char(
262         --     0xF1 + (n//0x1000000),
263         --     0x80 + (n//0x0040000),
264         --     0x80 + (n//0x0001000),
265         --     0x80 + (n//0x0000040) % 0x40,
266         --     0x80 +  n             % 0x40
267         -- )
268            return "?"
269        end
270    end
271
272    helpers.utfchar = utfchar
273
274    local utf8next         = R("\128\191")
275    local utf8one          = R("\000\127")
276    local utf8two          = R("\194\223") * utf8next
277    local utf8three        = R("\224\239") * utf8next * utf8next
278    local utf8four         = R("\240\244") * utf8next * utf8next * utf8next
279
280    helpers.utf8one   = utf8one
281    helpers.utf8two   = utf8two
282    helpers.utf8three = utf8three
283    helpers.utf8four  = utf8four
284
285    local utfidentifier    = utf8two + utf8three + utf8four
286    helpers.utfidentifier  = (R("AZ","az","__")      + utfidentifier)
287                           * (R("AZ","az","__","09") + utfidentifier)^0
288
289    helpers.utfcharpattern = P(1) * utf8next^0 -- unchecked but fast
290    helpers.utfbytepattern = utf8one   / byte
291                           + utf8two   / function(s) local c1, c2         = byte(s,1,2) return   c1 * 64 + c2                       -    12416 end
292                           + utf8three / function(s) local c1, c2, c3     = byte(s,1,3) return  (c1 * 64 + c2) * 64 + c3            -   925824 end
293                           + utf8four  / function(s) local c1, c2, c3, c4 = byte(s,1,4) return ((c1 * 64 + c2) * 64 + c3) * 64 + c4 - 63447168 end
294    helpers.charpattern    = usage == "scite" and 1 or helpers.utfcharpattern
295
296    local p_false          = P(false)
297    local p_true           = P(true)
298
299    local function make(t)
300        local function making(t)
301            local p    = p_false
302            local keys = sortedkeys(t)
303            for i=1,#keys do
304                local k = keys[i]
305                if k ~= "" then
306                    local v = t[k]
307                    if v == true then
308                        p = p + P(k) * p_true
309                    elseif v == false then
310                        -- can't happen
311                    else
312                        p = p + P(k) * making(v)
313                    end
314                end
315            end
316            if t[""] then
317                p = p + p_true
318            end
319            return p
320        end
321        local p    = p_false
322        local keys = sortedkeys(t)
323        for i=1,#keys do
324            local k = keys[i]
325            if k ~= "" then
326                local v = t[k]
327                if v == true then
328                    p = p + P(k) * p_true
329                elseif v == false then
330                    -- can't happen
331                else
332                    p = p + P(k) * making(v)
333                end
334            end
335        end
336        return p
337    end
338
339    local function collapse(t,x)
340        if type(t) ~= "table" then
341            return t, x
342        else
343            local n = next(t)
344            if n == nil then
345                return t, x
346            elseif next(t,n) == nil then
347                -- one entry
348                local k = n
349                local v = t[k]
350                if type(v) == "table" then
351                    return collapse(v,x..k)
352                else
353                    return v, x .. k
354                end
355            else
356                local tt = { }
357                for k, v in next, t do
358                    local vv, kk = collapse(v,k)
359                    tt[kk] = vv
360                end
361                return tt, x
362            end
363        end
364    end
365
366    function helpers.utfchartabletopattern(list)
367        local tree = { }
368        local n = #list
369        if n == 0 then
370            for s in next, list do
371                local t = tree
372                local p, pk
373                for c in gmatch(s,".") do
374                    if t == true then
375                        t = { [c] = true, [""] = true }
376                        p[pk] = t
377                        p = t
378                        t = false
379                    elseif t == false then
380                        t = { [c] = false }
381                        p[pk] = t
382                        p = t
383                        t = false
384                    else
385                        local tc = t[c]
386                        if not tc then
387                            tc = false
388                            t[c] = false
389                        end
390                        p = t
391                        t = tc
392                    end
393                    pk = c
394                end
395                if t == false then
396                    p[pk] = true
397                elseif t == true then
398                    -- okay
399                else
400                    t[""] = true
401                end
402            end
403        else
404            for i=1,n do
405                local s = list[i]
406                local t = tree
407                local p, pk
408                for c in gmatch(s,".") do
409                    if t == true then
410                        t = { [c] = true, [""] = true }
411                        p[pk] = t
412                        p = t
413                        t = false
414                    elseif t == false then
415                        t = { [c] = false }
416                        p[pk] = t
417                        p = t
418                        t = false
419                    else
420                        local tc = t[c]
421                        if not tc then
422                            tc = false
423                            t[c] = false
424                        end
425                        p = t
426                        t = tc
427                    end
428                    pk = c
429                end
430                if t == false then
431                    p[pk] = true
432                elseif t == true then
433                    -- okay
434                else
435                    t[""] = true
436                end
437            end
438        end
439        collapse(tree,"")
440        return make(tree)
441    end
442
443    patterns.invisibles = helpers.utfchartabletopattern {
444        utfchar(0x00A0), -- nbsp
445        utfchar(0x2000), -- enquad
446        utfchar(0x2001), -- emquad
447        utfchar(0x2002), -- enspace
448        utfchar(0x2003), -- emspace
449        utfchar(0x2004), -- threeperemspace
450        utfchar(0x2005), -- fourperemspace
451        utfchar(0x2006), -- sixperemspace
452        utfchar(0x2007), -- figurespace
453        utfchar(0x2008), -- punctuationspace
454        utfchar(0x2009), -- breakablethinspace
455        utfchar(0x200A), -- hairspace
456        utfchar(0x200B), -- zerowidthspace
457        utfchar(0x202F), -- narrownobreakspace
458        utfchar(0x205F), -- math thinspace
459        utfchar(0x200C), -- zwnj
460        utfchar(0x200D), -- zwj
461    }
462
463    -- now we can make:
464
465    patterns.wordtoken    = R("az","AZ","\127\255")
466    patterns.wordpattern  = patterns.wordtoken^3 -- todo: if limit and #s < limit then
467
468    patterns.iwordtoken   = patterns.wordtoken - patterns.invisibles
469    patterns.iwordpattern = patterns.iwordtoken^3
470
471end
472
473-- end of patterns
474
475-- begin of scite properties
476
477-- Because we use a limited number of lexers we can provide a new whitespace on demand. If needed
478-- we can recycle from a pool or we can just not reuse a lexer and load anew. I'll deal with that
479-- when the need is there. At that moment I might as well start working with nested tables (so that
480-- we have a langauge tree.
481
482local whitespace = function() return "whitespace" end
483
484local maxstyle    = 127 -- otherwise negative values in editor object -- 255
485local nesting     = 0
486local style_main  = 0
487local style_white = 0
488
489if usage == "scite" then
490
491    local names = { }
492    local props = { }
493    local count = 1
494
495    -- 32 -- 39 are reserved; we want to avoid holes so we preset:
496
497    for i=0,maxstyle do
498        numbers[i] = "default"
499    end
500
501    whitespace = function()
502        return style_main -- "mainspace"
503    end
504
505    function lexers.loadtheme(theme)
506        styles = theme or { }
507        for k, v in next, styles do
508            names[#names+1] = k
509        end
510        sort(names)
511        for i=1,#names do
512            local name = names[i]
513            styles[name].n = count
514            numbers[name] = count
515            numbers[count] = name
516            if count == 31 then
517                count = 40
518            else
519                count = count + 1
520            end
521        end
522        for i=1,#names do
523            local t = { }
524            local s = styles[names[i]]
525            local n = s.n
526            local fore = s.fore
527            local back = s.back
528            local font = s.font
529            local size = s.size
530            local bold = s.bold
531            if fore then
532                if #fore == 1 then
533                    t[#t+1] = format("fore:#%02X%02X%02X",fore[1],fore[1],fore[1])
534                elseif #fore == 3 then
535                    t[#t+1] = format("fore:#%02X%02X%02X",fore[1],fore[2],fore[3])
536                end
537            end
538            if back then
539                if #back == 1 then
540                    t[#t+1] = format("back:#%02X%02X%02X",back[1],back[1],back[1])
541                elseif #back == 3 then
542                    t[#t+1] = format("back:#%02X%02X%02X",back[1],back[2],back[3])
543                else
544                    t[#t+1] = "back:#000000"
545                end
546            end
547            if bold then
548                t[#t+1] = "bold"
549            end
550            if font then
551                t[#t+1] = format("font:%s",font)
552            end
553            if size then
554                t[#t+1] = format("size:%s",size)
555            end
556            if #t > 0 then
557                props[n] = concat(t,",")
558            end
559        end
560        setmetatable(styles, {
561            __index =
562                function(target,name)
563                    if name then
564                        count = count + 1
565                        if count > maxstyle then
566                            count = maxstyle
567                        end
568                        numbers[name] = count
569                        local style = { n = count }
570                        target[name] = style
571                        return style
572                    end
573                end
574        } )
575        lexers.styles  = styles
576        lexers.numbers = numbers
577
578        style_main  = styles.mainspace.n
579        style_white = styles.whitespace.n
580    end
581
582    function lexers.registertheme(properties,name)
583        for n, p in next, props do
584            local tag = "style.script_" .. name .. "." .. n
585            properties[tag] = p
586        end
587    end
588
589end
590
591-- end of scite properties
592
593-- begin of word matchers
594
595do
596
597    -- we can load characters.lower if we can find it
598
599    local pattern = false
600    local mapping = { }
601
602    lower = function(str)
603        if not pattern and next(mapping) then
604            pattern = Cs((helpers.utfchartabletopattern(mapping)/mapping + helpers.utfcharpattern)^1)
605        end
606        return pattern and lpegmatch(pattern,str) or str
607    end
608
609    helpers.lowercasestring = lower
610
611    helpers.registermapping = function(data)
612        local l = data.lower
613        if l then
614            for k, v in next, l do
615                mapping[k] = v
616            end
617        end
618        pattern = false
619    end
620
621end
622
623do
624
625  -- function patterns.exactmatch(words,case_insensitive)
626  --     local characters = concat(words)
627  --     local pattern = S(characters) + patterns.idtoken
628  --     if case_insensitive then
629  --         pattern = pattern + S(upper(characters)) + S(lower(characters))
630  --     end
631  --     if case_insensitive then
632  --         local list = { }
633  --         if #words == 0 then
634  --             for k, v in next, words do
635  --                 list[lower(k)] = v
636  --             end
637  --         else
638  --             for i=1,#words do
639  --                 list[lower(words[i])] = true
640  --             end
641  --         end
642  --         return Cmt(pattern^1, function(_,i,s)
643  --             return list[lower(s)] -- and i or nil
644  --         end)
645  --     else
646  --         local list = { }
647  --         if #words == 0 then
648  --             for k, v in next, words do
649  --                 list[k] = v
650  --             end
651  --         else
652  --             for i=1,#words do
653  --                 list[words[i]] = true
654  --             end
655  --         end
656  --         return Cmt(pattern^1, function(_,i,s)
657  --             return list[s] -- and i or nil
658  --         end)
659  --     end
660  -- end
661  --
662  -- function patterns.justmatch(words)
663  --     local p = P(words[1])
664  --     for i=2,#words do
665  --         p = p + P(words[i])
666  --     end
667  --     return p
668  -- end
669
670    -- we could do camelcase but that is not what users use for keywords
671
672    local p_finish = #(1 - R("az","AZ","__"))
673
674    patterns.finishmatch = p_finish
675
676    function patterns.exactmatch(words,ignorecase)
677        local list = { }
678        if ignorecase then
679            if #words == 0 then
680                for k, v in next, words do
681                    list[lower(k)] = v
682                end
683            else
684                for i=1,#words do
685                    list[lower(words[i])] = true
686                end
687            end
688            return Cmt(pattern^1, function(_,i,s)
689                return list[lower(s)] -- and i or nil
690            end)
691        else
692            if #words == 0 then
693                for k, v in next, words do
694                    list[k] = v
695                end
696            else
697                for i=1,#words do
698                    list[words[i]] = true
699                end
700            end
701        end
702        return helpers.utfchartabletopattern(list) * p_finish
703    end
704
705    patterns.justmatch = patterns.exactmatch
706
707end
708
709-- end of word matchers
710
711-- begin of loaders
712
713do
714
715    local cache = { }
716
717    function lexers.loadluafile(name)
718        local okay, data = pcall(require, name)
719        if data then
720            if trace then
721                report("lua file '%s' has been loaded",name)
722            end
723            return data, name
724        end
725        if trace then
726            report("unable to load lua file '%s'",name)
727        end
728    end
729
730    function lexers.loaddefinitions(name)
731        local data = cache[name]
732        if data then
733            if trace then
734                report("reusing definitions '%s'",name)
735            end
736            return data
737        elseif trace and data == false then
738            report("definitions '%s' were not found",name)
739        end
740        local okay, data = pcall(require, name)
741        if not data then
742            report("unable to load definition file '%s'",name)
743            data = false
744        elseif trace then
745            report("definition file '%s' has been loaded",name)
746        end
747        cache[name] = data
748        return type(data) == "table" and data
749    end
750
751end
752
753-- end of loaders
754
755-- begin of spell checking (todo: pick files from distribution instead)
756
757do
758
759    -- spell checking (we can only load lua files)
760    --
761    -- return {
762    --     min   = 3,
763    --     max   = 40,
764    --     n     = 12345,
765    --     words = {
766    --         ["someword"]    = "someword",
767    --         ["anotherword"] = "Anotherword",
768    --     },
769    -- }
770
771    local lists    = { }
772    local disabled = false
773
774    function lexers.disablewordcheck()
775        disabled = true
776    end
777
778    function lexers.setwordlist(tag,limit) -- returns hash (lowercase keys and original values)
779        if not tag or tag == "" then
780            return false, 3
781        end
782        local list = lists[tag]
783        if not list then
784            list = lexers.loaddefinitions("spell-" .. tag)
785            if not list or type(list) ~= "table" then
786                report("invalid spell checking list for '%s'",tag)
787                list = { words = false, min = 3 }
788            else
789                list.words = list.words or false
790                list.min   = list.min or 3
791            end
792            lists[tag] = list
793            helpers.registermapping(list)
794        end
795        if trace then
796            report("enabling spell checking for '%s' with minimum '%s'",tag,list.min)
797        end
798        return list.words, list.min
799    end
800
801    if usage ~= "scite" then
802
803        function lexers.styleofword(validwords,validminimum,s,p)
804            if not validwords or #s < validminimum then
805                return "text", p
806            else
807                -- keys are lower
808                local word = validwords[s]
809                if word == s then
810                    return "okay", p -- exact match
811                elseif word then
812                    return "warning", p -- case issue
813                else
814                    local word = validwords[lower(s)]
815                    if word == s then
816                        return "okay", p -- exact match
817                    elseif word then
818                        return "warning", p -- case issue
819                    elseif upper(s) == s then
820                        return "warning", p -- probably a logo or acronym
821                    else
822                        return "error", p
823                    end
824                end
825            end
826        end
827
828    end
829
830end
831
832-- end of spell checking
833
834-- begin lexer management
835
836lexers.structured = false
837-- lexers.structured = true -- the future for the typesetting end
838
839do
840
841    function lexers.new(name,filename)
842        if not filename then
843            filename = false
844        end
845        local lexer = {
846            name       = name,
847            filename   = filename,
848            whitespace = whitespace()
849        }
850        if trace then
851            report("initializing lexer tagged '%s' from file '%s'",name,filename or name)
852        end
853        return lexer
854    end
855
856    if usage == "scite" then
857
858        -- overloaded later
859
860        function lexers.token(name, pattern)
861            local s = styles[name] -- always something anyway
862            return pattern * Cc(s and s.n or 32) * Cp()
863        end
864
865    else
866
867        function lexers.token(name, pattern)
868            return pattern * Cc(name) * Cp()
869        end
870
871    end
872
873    -- todo: variant that directly styles
874
875    local function append(pattern,step)
876        if not step then
877            return pattern
878        elseif pattern then
879            return pattern + P(step)
880        else
881            return P(step)
882        end
883    end
884
885    local function prepend(pattern,step)
886        if not step then
887            return pattern
888        elseif pattern then
889            return P(step) + pattern
890        else
891            return P(step)
892        end
893    end
894
895    local wrapup = usage == "scite" and
896        function(name,pattern)
897            return pattern
898        end
899    or
900        function(name,pattern,nested)
901            if lexers.structured then
902                return Cf ( Ct("") * Cg(Cc("name") * Cc(name)) * Cg(Cc("data") * Ct(pattern)), rawset)
903            elseif nested then
904                return pattern
905            else
906                return Ct (pattern)
907            end
908        end
909
910    local function construct(namespace,lexer,level)
911        if lexer then
912            local rules    = lexer.rules
913            local embedded = lexer.embedded
914            local grammar  = nil
915            if embedded then
916                for i=1,#embedded do
917                    local embed = embedded[i]
918                    local done  = embed.done
919                    if not done then
920                        local lexer = embed.lexer
921                        local start = embed.start
922                        local stop  = embed.stop
923                        if usage == "scite" then
924                            start = start / function() nesting = nesting + 1 end
925                            stop  = stop  / function() nesting = nesting - 1 end
926                        end
927                        if trace then
928                            start = start / function() report("    nested lexer %s: start",lexer.name) end
929                            stop  = stop  / function() report("    nested lexer %s: stop", lexer.name) end
930                        end
931                        done = start * (construct(namespace,lexer,level+1) - stop)^0 * stop
932                        done = wrapup(lexer.name,done,true)
933                    end
934               -- grammar = prepend(grammar, done)
935                  grammar = append(grammar, done)
936                end
937            end
938            if rules then
939                for i=1,#rules do
940                    grammar = append(grammar,rules[i][2])
941                end
942            end
943            return grammar
944        end
945    end
946
947    function lexers.load(filename,namespace)
948        if not namespace then
949            namespace = filename
950        end
951        local lexer = usedlexers[namespace] -- we load by filename but the internal name can be short
952        if lexer then
953            if trace then
954                report("reusing lexer '%s'",namespace)
955            end
956            return lexer
957        elseif trace then
958            report("loading lexer '%s' from '%s'",namespace,filename)
959        end
960        local lexer, name = lexers.loadluafile(filename)
961        if not lexer then
962            report("invalid lexer file '%s'",filename)
963        elseif type(lexer) ~= "table" then
964            if trace then
965                report("lexer file '%s' gets a dummy lexer",filename)
966            end
967            return lexers.new(filename)
968        end
969        local grammar = construct(namespace,lexer,1)
970        if grammar then
971            grammar = wrapup(namespace,grammar^0)
972            lexer.grammar = grammar
973        end
974        --
975        local backtracker = lexer.backtracker
976        local foretracker = lexer.foretracker
977        if backtracker then
978            local start    = 1
979            local position = 1
980            local pattern  = (Cmt(Cs(backtracker),function(s,p,m) if p > start then return #s else position = p - #m end end) + P(1))^1
981            lexer.backtracker = function(str,offset)
982                position = 1
983                start    = offset
984                lpegmatch(pattern,str,1)
985                return position
986            end
987        end
988        if foretracker then
989            local start    = 1
990            local position = 1
991            local pattern  = (Cmt(Cs(foretracker),function(s,p,m) position = p - #m return #s end) + P(1))^1
992            lexer.foretracker = function(str,offset)
993                position = offset
994                start    = offset
995                lpegmatch(pattern,str,position)
996                return position
997            end
998        end
999        --
1000        usedlexers[filename] = lexer
1001        return lexer
1002    end
1003
1004    function lexers.embed(parent, embed, start, stop, rest)
1005        local embedded = parent.embedded
1006        if not embedded then
1007            embedded        = { }
1008            parent.embedded = embedded
1009        end
1010        embedded[#embedded+1] = {
1011            lexer = embed,
1012            start = start,
1013            stop  = stop,
1014            rest  = rest,
1015        }
1016    end
1017
1018end
1019
1020-- end lexer management
1021
1022-- This will become a configurable option (whole is more reliable but it can
1023-- be slow on those 5 megabyte lua files):
1024
1025-- begin of context typesetting lexer
1026
1027if usage ~= "scite" then
1028
1029    local function collapsed(t)
1030        local lasttoken = nil
1031        local lastindex = nil
1032        for i=1,#t,2 do
1033            local token    = t[i]
1034            local position = t[i+1]
1035            if token == lasttoken then
1036                t[lastindex] = position
1037            elseif lastindex then
1038                lastindex = lastindex + 1
1039                t[lastindex] = token
1040                lastindex = lastindex + 1
1041                t[lastindex] = position
1042                lasttoken = token
1043            else
1044                lastindex = i+1
1045                lasttoken = token
1046            end
1047        end
1048        for i=#t,lastindex+1,-1 do
1049            t[i] = nil
1050        end
1051        return t
1052    end
1053
1054    function lexers.lex(lexer,text) -- get rid of init_style
1055        local grammar = lexer.grammar
1056        if grammar then
1057            nesting = 0
1058            if trace then
1059                report("lexing '%s' string with length %i",lexer.name,#text)
1060            end
1061            local t = lpegmatch(grammar,text)
1062            if collapse then
1063                t = collapsed(t)
1064            end
1065            return t
1066        else
1067            return { }
1068        end
1069    end
1070
1071end
1072
1073-- end of context typesetting lexer
1074
1075-- begin of scite editor lexer
1076
1077if usage == "scite" then
1078
1079    -- For char-def.lua we need some 0.55 s with Lua 5.3 and 10% less with Lua 5.4 (timed on a 2013
1080    -- Dell precision with i7-3840QM). That test file has 271540 lines of Lua (table) code and is
1081    -- 5.312.665 bytes large (dd 2021.09.29). The three methods perform about the same but the more
1082    -- direct approach saves some tables. Using the new Lua garbage collector makes no difference.
1083    --
1084    -- We can actually integrate folding in here if we want but it might become messy as we then
1085    -- also need to deal with specific newlines. We can also (in scite) store some extra state wrt
1086    -- the language used.
1087    --
1088    -- Operating on a range (as in the past) is faster when editing very large documents but we
1089    -- don't do that often. The problem is that backtracking over whitespace is tricky for some
1090    -- nested lexers.
1091
1092    local editor       = false
1093    local startstyling = false   -- editor:StartStyling(position,style)
1094    local setstyling   = false   -- editor:SetStyling(slice,style)
1095    local getlevelat   = false   -- editor.StyleAt[position] or StyleAt(editor,position)
1096    local getlineat    = false
1097    local thestyleat   = false   -- editor.StyleAt[position]
1098    local thelevelat   = false
1099
1100    local styleoffset  = 1
1101    local foldoffset   = 0
1102
1103    local function seteditor(usededitor)
1104        editor       = usededitor
1105        startstyling = editor.StartStyling
1106        setstyling   = editor.SetStyling
1107        getlevelat   = editor.FoldLevel        -- GetLevelAt
1108        getlineat    = editor.LineFromPosition
1109        thestyleat   = editor.StyleAt
1110        thelevelat   = editor.FoldLevel        -- SetLevelAt
1111    end
1112
1113    function lexers.token(style, pattern)
1114        if type(style) ~= "number" then
1115            style = styles[style] -- always something anyway
1116            style = style and style.n or 32
1117        end
1118        return pattern * Cp() / function(p)
1119            local n = p - styleoffset
1120            if nesting > 0 and style == style_main then
1121                style = style_white
1122            end
1123            setstyling(editor,n,style)
1124            styleoffset = styleoffset + n
1125        end
1126    end
1127
1128    -- used in: tex txt xml
1129
1130    function lexers.styleofword(validwords,validminimum,s,p)
1131        local style
1132        if not validwords or #s < validminimum then
1133            style = numbers.text
1134        else
1135            -- keys are lower
1136            local word = validwords[s]
1137            if word == s then
1138                style = numbers.okay -- exact match
1139            elseif word then
1140                style = numbers.warning -- case issue
1141            else
1142                local word = validwords[lower(s)]
1143                if word == s then
1144                    style = numbers.okay -- exact match
1145                elseif word then
1146                    style = numbers.warning -- case issue
1147                elseif upper(s) == s then
1148                    style = numbers.warning -- probably a logo or acronym
1149                else
1150                    style = numbers.error
1151                end
1152            end
1153        end
1154        local n = p - styleoffset
1155        setstyling(editor,n,style)
1156        styleoffset = styleoffset + n
1157    end
1158
1159    -- when we have an embedded language we can not rely on the range that
1160    -- scite provides because we need to look further
1161
1162    -- it looks like scite starts before the cursor / insert
1163
1164    local function scite_range(lexer,size,start,length,partial) -- set editor
1165        if partial then
1166            local backtracker = lexer.backtracker
1167            local foretracker = lexer.foretracker
1168            if start == 0 and size == length then
1169                -- see end
1170            elseif (backtracker or foretracker) and start > 0 then
1171                local snippet = editor:textrange(0,size)
1172                if size ~= length then
1173                    -- only lstart matters, the rest is statistics; we operate on 1-based strings
1174                    local lstart = backtracker and backtracker(snippet,start+1) or 0
1175                    local lstop  = foretracker and foretracker(snippet,start+1+length) or size
1176                    if lstart > 0 then
1177                        lstart = lstart - 1
1178                    end
1179                    if lstop > size then
1180                        lstop = size - 1
1181                    end
1182                    local stop    = start + length
1183                    local back    = start - lstart
1184                    local fore    = lstop - stop
1185                    local llength = lstop - lstart + 1
1186                 -- snippet = string.sub(snippet,lstart+1,lstop+1) -- we can return the initial position in the lpegmatch
1187                 -- return back, fore, lstart, llength, snippet, lstart + 1
1188                    return back, fore, 0, llength, snippet, lstart + 1
1189                else
1190                    return 0, 0, 0, size, snippet, 1
1191                end
1192            else
1193                -- still not entirely okay (nested mp)
1194                local stop   = start + length
1195                local lstart = start
1196                local lstop  = stop
1197                while lstart > 0 do
1198                    if thestyleat[lstart] == style_main then
1199                        break
1200                    else
1201                        lstart = lstart - 1
1202                    end
1203                end
1204                if lstart < 0 then
1205                    lstart = 0
1206                end
1207                while lstop < size do
1208                    if thestyleat[lstop] == style_main then
1209                        break
1210                    else
1211                        lstop = lstop + 1
1212                    end
1213                end
1214                if lstop > size then
1215                    lstop = size
1216                end
1217                local back    = start - lstart
1218                local fore    = lstop - stop
1219                local llength = lstop - lstart + 1
1220                local snippet = editor:textrange(lstart,lstop)
1221                if llength > #snippet then
1222                    llength = #snippet
1223                end
1224                return back, fore, lstart, llength, snippet, 1
1225            end
1226        end
1227        local snippet = editor:textrange(0,size)
1228        return 0, 0, 0, size, snippet, 1
1229    end
1230
1231    local function scite_lex(lexer,text,offset,initial)
1232        local grammar = lexer.grammar
1233        if grammar then
1234            styleoffset = 1
1235            nesting     = 0
1236            startstyling(editor,offset,32)
1237            local preamble = lexer.preamble
1238            if preamble then
1239                lpegmatch(preamble,offset == 0 and text or editor:textrange(0,500))
1240            end
1241            lpegmatch(grammar,text,initial)
1242        end
1243    end
1244
1245    -- We can assume sane definitions that is: must languages use similar constructs for the start
1246    -- and end of something. So we don't need to waste much time on nested lexers.
1247
1248    local newline           = patterns.newline
1249
1250    local scite_fold_base   = SC_FOLDLEVELBASE       or 0
1251    local scite_fold_header = SC_FOLDLEVELHEADERFLAG or 0
1252    local scite_fold_white  = SC_FOLDLEVELWHITEFLAG  or 0
1253    local scite_fold_number = SC_FOLDLEVELNUMBERMASK or 0
1254
1255    local function styletonumbers(folding,hash)
1256        if not hash then
1257            hash = { }
1258        end
1259        if folding then
1260            for k, v in next, folding do
1261                local s = hash[k] or { }
1262                for k, v in next, v do
1263                    local n = numbers[k]
1264                    if n then
1265                        s[n] = v
1266                    end
1267                end
1268                hash[k] = s
1269            end
1270        end
1271        return hash
1272    end
1273
1274    local folders = setmetatable({ }, { __index = function(t, lexer)
1275        local folding = lexer.folding
1276        if folding then
1277            local foldmapping = styletonumbers(folding)
1278            local embedded    = lexer.embedded
1279            if embedded then
1280                for i=1,#embedded do
1281                    local embed = embedded[i]
1282                    local lexer = embed.lexer
1283                    if lexer then
1284                        foldmapping = styletonumbers(lexer.folding,foldmapping)
1285                    end
1286                end
1287            end
1288            local foldpattern = helpers.utfchartabletopattern(foldmapping)
1289            local resetparser = lexer.resetparser
1290            local line        = 0
1291            local current     = scite_fold_base
1292            local previous    = scite_fold_base
1293            --
1294            foldpattern = Cp() * (foldpattern/foldmapping) / function(s,match)
1295                if match then
1296                    local l = match[thestyleat[s + foldoffset - 1]]
1297                    if l then
1298                        current = current + l
1299                    end
1300                end
1301            end
1302            local action_yes = function()
1303                if current > previous then
1304                    previous = previous | scite_fold_header
1305                elseif current < scite_fold_base then
1306                    current = scite_fold_base
1307                end
1308                thelevelat[line] = previous
1309                previous = current
1310                line = line + 1
1311            end
1312            local action_nop = function()
1313                previous = previous | scite_fold_white
1314                thelevelat[line] = previous
1315                previous = current
1316                line = line + 1
1317            end
1318            --
1319            foldpattern = ((foldpattern + (1-newline))^1 * newline/action_yes + newline/action_nop)^0
1320            --
1321            folder = function(text,offset,initial)
1322                if reset_parser then
1323                    reset_parser()
1324                end
1325                foldoffset = offset
1326                nesting    = 0
1327                --
1328                previous   = scite_fold_base -- & scite_fold_number
1329                if foldoffset == 0 then
1330                    line = 0
1331                else
1332                    line = getlineat(editor,offset) & scite_fold_number -- scite is at the beginning of a line
1333                 -- previous = getlevelat(editor,line) -- alas
1334                    previous = thelevelat[line] -- zero/one
1335                end
1336                current = previous
1337                lpegmatch(foldpattern,text,initial)
1338            end
1339        else
1340            folder = function() end
1341        end
1342        t[lexer] = folder
1343        return folder
1344    end } )
1345
1346    -- can somehow be called twice (idem for the lexer)
1347
1348    local function scite_fold(lexer,text,offset,initial)
1349        if text ~= "" then
1350            return folders[lexer](text,offset,initial)
1351        end
1352    end
1353
1354    -- We cannot use the styler style setters so we use the editor ones. This has to do with the fact
1355    -- that the styler sees the (utf) encoding while we are doing bytes. There is also some initial
1356    -- skipping over characters. First versions uses those callers and had to offset by -2, but while
1357    -- that works with whole document lexing it doesn't work with partial lexing (one can also get
1358    -- multiple OnStyle calls per edit.
1359    --
1360    -- The backtracking here relates to the fact that we start at the outer lexer (otherwise embedded
1361    -- lexers can have occasional side effects. It also makes it possible to do better syntax checking
1362    -- on the fly (some day).
1363    --
1364    -- The (old) editor:textrange cannot handle nul characters. It that doesn't get patched in scite we
1365    -- need to use the styler variant (which is not in scite).
1366
1367    -- lexer    : context lexer
1368    -- editor   : scite editor object (needs checking every update)
1369    -- language : scite lexer language id
1370    -- filename : current file
1371    -- size     : size of current file
1372    -- start    ; first position where to edit
1373    -- length   : length stripe to edit
1374    -- trace    : flag that signals tracing
1375
1376    -- After quite some experiments with the styler methods I settled on the editor methods because
1377    -- these are not sensitive for utf and have no side effects like the two forward cursor positions.
1378
1379    function lexers.scite_onstyle(lexer,editor,partial,language,filename,size,start,length,trace)
1380        seteditor(editor)
1381        local clock   = trace and os.clock()
1382        local back, fore, lstart, llength, snippet, initial = scite_range(lexer,size,start,length,partial)
1383        if clock then
1384            report("lexing %s", language)
1385            report("  document file : %s", filename)
1386            report("  document size : %i", size)
1387            report("  styler start  : %i", start)
1388            report("  styler length : %i", length)
1389            report("  backtracking  : %i", back)
1390            report("  foretracking  : %i", fore)
1391            report("  lexer start   : %i", lstart)
1392            report("  lexer length  : %i", llength)
1393            report("  text length   : %i", #snippet)
1394            report("  lexing method : %s", partial and "partial" or "whole")
1395            report("  after copying : %0.3f seconds",os.clock()-clock)
1396        end
1397        scite_lex(lexer,snippet,lstart,initial)
1398        if clock then
1399            report("  after lexing  : %0.3f seconds",os.clock()-clock)
1400        end
1401        scite_fold(lexer,snippet,lstart,initial)
1402        if clock then
1403            report("  after folding : %0.3f seconds",os.clock()-clock)
1404        end
1405    end
1406
1407end
1408
1409-- end of scite editor lexer
1410
1411lexers.context = lexers -- for now
1412
1413return lexers
1414