Just Created a Tree-sitter Parser: Now What?

The authors of Tree-sitter have provided detailed documentation on creating your own language parsers. So, following the tutorial, you have meticulously crafted your own grammar using tree-sitter-cli. You ran tree-sitter generate as the docs instructed and this generated the C code required to parse your language. You wrote a simple code snippet and tested parsing it with tree-sitter parse path/to/your/code. The result matches perfectly with your expectations. Everything is going under control.

Now it's time for the next step. You can't wait to integrate the parser into your project written in Python/C++/Java/... Tree-sitter provides bindings for these languages, making it easy to work with, and a list of known parsers is readily available for you to explore and use. But what about custom parsers? You go back to the documentation, and it looks like the story ends when the parser is created. Wait, what?

Actually, you're not alone in this. An issue on Tree-sitter's GitHub repository raised a similar question, but I think they were overcomplicating things. Let's keep it simple and see what the CLI has generated for us:

It looks like the bindings folder contains bindings for our parser in different languages. However, when we look closer at the subfolders, these bindings are just empty frameworks without the actual parser. In the root directory of the parser project, we also find some files like setup.py. Since I'm working with Python, I opened the setup.py script and found these lines in the setup section:

setup(
    packages=find_packages("bindings/python"),
    package_dir={"": "bindings/python"},
    package_data={
        "tree_sitter_imp": ["*.pyi", "py.typed"],
        "tree_sitter_imp.queries": ["*.scm"],
    },
    ext_package="tree_sitter_imp",
    ext_modules=[
        Extension(
            name="_binding",
            sources=[
                "bindings/python/tree_sitter_imp/binding.c",
                "src/parser.c",
            ],
            define_macros=[
                ("PY_SSIZE_T_CLEAN", None),
                ("TREE_SITTER_HIDE_SYMBOLS", None),
            ],
            include_dirs=["src"],
            py_limited_api=not get_config_var("Py_GIL_DISABLED"),
        )
    ],
    cmdclass={
        "build": Build,
        "build_ext": BuildExt,
        "bdist_wheel": BdistWheel,
        "egg_info": EggInfo,
    },
    zip_safe=False
)

Isn't this exactly what we need to combine the parser with the Python language binding? Just like in the setup scripts of general packages, the build_ext command is used to compile C/C++ extension modules for the package. It's an important part of the setuptools build process and is often used with the --inplace option. So, I set up the Python virtual environment and ran python setup.py build_ext --inplace. And there we have it! A dynamic library file named _binding.abi3.so has been built and copied into the Python language binding folder.

Now, let's test the functionality of the package. In the bindings/python/tests folder, there's a simple test script called test_binding.py. We install all the dependencies and run it, and BAM! It works. From now on, my tree_sitter_imp package folder can be copied to my project and I just need to build a parser like this:

from tree_sitter import Language, Parser
import tree_sitter_imp

my_parser = Parser(Language(tree_sitter_imp.language()))

And, that's it. Have fun with your custom parser ;-)

…Yet Another Hack

In fact, my initial solution for this issue was veeeeeery different.

Remember the generate command creates the C code for the parser, right? This means we can always build the parser as a library. Tree-sitter provides a tool for this. Using the command tree-sitter build, a dynamic library is built in the root folder of the parser project. I'm working on macOS, so it's named imp.dylib. If you're on Linux or Windows, the name will end with .so or .dll.

To build a Parser object, you need to pass a Language object to the initializer, and for custom language parsers, the initializer for Language requires a pointer for the language function, like tree_sitter_my_lang(). We only need to load the dynamic library into our project, and get handle of the function. How do we deal with dynamic libraries? Different answers for different languages. In Python, we use the ctypes package. And this is how I create a wrapper for the library:

import ctypes
from tree_sitter import Language, Parser

LANG_NAME = 'imp'
LIB_PATH = 'lib/imp.dylib'

def imp_parser() -> Parser:
    lib = ctypes.CDLL(LIB_PATH)
    lang_func = getattr(lib, 'tree_sitter_imp')
    lang_func.restype = ctypes.c_void_p
    return Parser(Language(lang_func()))

You just need to put the library under the subfolder lib and import the wrapper, and then calling imp_parser() would directly get you a parser of the language. Again, have fun! But I suggest using the first approach. Much more elegant, isn't it? 🤪

Just Created a Tree-sitter Parser: Now What?

…Yet Another Hack

Comments

More from this blog

Controlled Branching Processes: A Case Study in Structured Sampling

Command Palette

…Yet Another Hack

Comments

More from this blog