Skip to content

Conversation

@AverageHelper
Copy link
Contributor

@AverageHelper AverageHelper commented Jul 9, 2025

Closes #560

Version 0.24.1 of the "Gemini hypertext format" is specified at https://geminiprotocol.net/docs/gemtext-specification.gmi.

This is my first time making a Lexer here 😅 I first built it in XML format, but couldn't work out how to get it to also handle syntax inside of preformatted blocks, so I moved it to Go using markdown.go as a reference.

my original gemtext.xml, which tests the same except preformatted blocks are always plain text
<lexer>
  <config>
    <name>Gemtext</name>
    <alias>gemtext</alias>
    <alias>gmi</alias>
    <alias>gmni</alias>
    <alias>gemini</alias>
    <filename>*.gmi</filename>
    <filename>*.gmni</filename>
    <filename>*.gemini</filename>
    <mime_type>text/gemini</mime_type>
  </config>
  <rules>
    <state name="root">
      <rule pattern="^(#[^#].+\r?\n)">
        <token type="GenericHeading" />
      </rule>
      <rule pattern="^(#{2,3}.+\r?\n)">
        <token type="GenericSubheading" />
      </rule>
      <rule pattern="^(\* )(.+\r?\n)">
        <bygroups>
          <token type="Keyword" />
          <token type="Text" />
        </bygroups>
      </rule>
      <rule pattern="^(>)(.+\r?\n)">
        <bygroups>
          <token type="Keyword" />
          <token type="GenericEmph" />
        </bygroups>
      </rule>
      <rule pattern="^(```\r?\n)([\w\W]*?)(^```)(.+\r?\n)?">
        <bygroups>
          <token type="LiteralString" />
          <token type="Text" />
          <token type="LiteralString" />
          <token type="Comment" />
        </bygroups>
      </rule>
      <rule pattern="^(```)(.+\r?\n)([\w\W]*?)(^```)(.+\r?\n)?">
        <bygroups>
          <token type="LiteralString" />
          <token type="LiteralString" />
          <token type="Text" />
          <token type="LiteralString" />
          <token type="Comment" />
        </bygroups>
      </rule>
      <rule pattern="^(=>)(\s*)([^\s]+)(\s*)$">
        <bygroups>
          <token type="Keyword" />
          <token type="Text" />
          <token type="NameAttribute" />
          <token type="Text" />
        </bygroups>
      </rule>
      <rule pattern="^(=>)(\s*)([^\s]+)(\s+)(.+)$">
        <bygroups>
          <token type="Keyword" />
          <token type="Text" />
          <token type="NameAttribute" />
          <token type="Text" />
          <token type="NameTag" />
        </bygroups>
      </rule>
      <rule pattern=".|(?:\r?\n)">
        <token type="Text" />
      </rule>
    </state>
  </rules>
</lexer>

The spec mentions that CRLF and LF are essentially interchangeable. I wasn't sure how to test that since IIRC Git clobbers CR on Unix systems, but this regex seems right at least..

@AverageHelper AverageHelper marked this pull request as ready for review July 9, 2025 23:28
@alecthomas alecthomas merged commit f3be4c6 into alecthomas:master Jul 10, 2025
2 checks passed
@AverageHelper AverageHelper deleted the avg/gemtext branch July 10, 2025 04:42
@alecthomas
Copy link
Owner

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants