#dev with Brandon

Controlled Branching Processes: A Case Study in Structured Sampling

Brandon Wong — Mon, 17 Nov 2025 23:18:25 GMT

Background & Problem Statement

Earlier this week, we were developing a random AST sampler for a toy language, which is a small subset of a folklore language, IMP. It has a really simple syntax:

syntax AExp ::= Int | Id
              | AExp "/" AExp
              > AExp "+" AExp
              | "(" AExp ")"
syntax Program ::= AExp

Simply, the syntax of a program contains only an arithmetic expression denoted as AExp. And an AExp node in the tiny AST could be an integer Int, an identifier Id, a division operator, an addition operator or a bracketed expression. The recursive definition in the operators enables the ASTs to expand in width and depth.

In our implementation, we create a concrete node type and a corresponding mask node type for each production rule in the syntax. This representation allows us to hide (partial) semantic information from a real-world or synthesized program and thus plays an important role in the forward/reverse process of a graph edit neural network.

To synthesize random samples for the language, we employ a top-down generation strategy, where we always start with a program with an [AExpMask] as its body. Next, we perform two types of actions to the nodes, to grow the AST into a semantically complete program without any masks:

Mask-Down. As we have described above, the node types are defined as a hierarchical structure, where the abstract type AExp is inherited by other expression types. Hence, an [AExpMask] node can be replaced with an [IntMask], an [IdMask], a [DivExpMask], an [AddExpMask], or a [BrAExpMask].
Unmask. A mask for a concrete expression type can be replaced with an instance of the concrete expression node. For terminal types in the language definition, i.e., Int and Id, we replace a mask with a node with a random sampled numeric value or identifier name. For non-terminal types, we create an empty node of the concrete type, with each operand, i.e., the nested expression, set as an [AExpMask].

We apply the actions to the masks in the AST until there remains none. In fact, the two actions can also be considered as one, which iteratively converts an [AExpMask] to a concrete expression node and applies the production rules to grow the tree. Between the two types of actions, it can be easily seen that Mask-Down controls the depth and width of the tree. To keep the sampled programs in a reasonable scale, the sampler performs the action based on a pre-defined set of weights.

To simplify the problem, we only Mask-Down an [AExpMask] to an [IntMask], an [IdMask], a [DivExpMask] or an [AddExpMask], where the former two are terminal nodes in an AST and do not have offsprings, and the latter two are non-terminal nodes with exactly 2 offsprings. After the generation of an AST, we review its structure and add brackets to certain subtrees if necessary. Now we wonder: how should we set the weights for the Mask-Down action to gain a nice AST? In other words, we would like to figure out how the distribution of concrete nodes affects the scale of the resulting ASTs.

With the above simplification, we can formalize the problem as:

A binary tree contains two types of nodes: NT and T, where each NT node has exactly two offsprings and each T node is a leaf node. Now, with a given distribution of the occurrence of NT and T, how can we compute the probability distribution of the the resulting trees’ depths?

From the Branching Process Perspective

The above problem is actually a Galton-Watson branching process, which can be expressed as follows:

The process starts with a single ancestor in the initial generation (root node of the binary tree

$$Z_0=1$$

The number of offspring for any given individual, denoted by the random variable $X$, follows the specific probability mass function $p_k=P(X=k)$:

$$\begin{cases} p_0 = P(X=0) = p, \\ p_2 = P(X=2) = 1-p, \\ p_k = P(X=k) = 0, \text{for} \ k \in \{1, 3, 4, \dots\} \end{cases}$$

The number of descendants are independent and identically distributed (i.i.d.) for all individuals in all generations.

The size of the $(n+1)$-th generation, $Z_{n+1}$, is determined by the sum of the offspring produces by all individuals in the $n$-th generation, $Z_n$. Formally, if we let $X_{n,i}$ be the number of offspring produced by the $i$-th individual in generation $Z_n$, then we have

$$Z_{n+1}=\sum_{i=1}^{Z_n}X_{n+1,i},$$

where $\{X_{n+1,i}\}$ is a sequence of i.i.d. random variables with the common offspring distribution $P(X=k)$ defined above, and they are independent of $Z_0, Z_1, \dots, Z_n$.

The mean number of offspring per individual, $\mu$, is a crucial parameter for determining the long-term behavior of the process, where we have

$$\mu=E[X]=\sum_{k=0}^\infty{k \cdot p_k} = 0 \cdot p + 2 \cdot (1 - p) = 2 (1 - p).$$

This process is a discrete-time Markov chain on the state space of non-negative integers $\mathbb{Z}_{\geq 0}$, with the state $Z_n=0$ being an absorbing state (extinction).

Extinction Time aka Depth of Trees

Now we try to compute the extinction time of the branching process, i.e., the depth of the binary tree. The extinction of the Galton-Watson process is defined as the first generation $N$ where the population size drops to zero. Formally, we have that

$$N=\inf\{n \mid Z_n = 0\},$$

where $\inf\emptyset = \infty$. If the population never becomes extinct, i.e., for all $n$ we have $Z_n>0$, then $N=\infty$.

The overall probability of ultimate extinction, $\pi$, is the probability that the extinction time is finite:

$$\pi = P(N<\infty) = P(\exists n \geq 1 \ . Z_n = 0).$$

For the specific branching process under our settings, the value of $\pi$ is the smallest non-negative root of the equation

$$s=G(s),$$

where $G(s)$ is the probability generating function (PGF) of the offspring distribution. More specifically, our PGF for the offspring is

$$G(s)=E[s^X]=\sum_{k=0}^\infty{s^kp_k}=s^0 \cdot p + s^2 \cdot (1 - p) = p + (1-p)s^2.$$

The extinction probability $\pi$ is the smallest non-negative solution to

$$\pi = p + (1-p) \pi^2.$$

In general, calculating $P(N=k)$ requires iterating the PGF. Let $G_k(s)$ be the PGF for $Z_k$, and we have

$$G_k(s)=E[s^{Z_k}]=G(G_{k-1}(s)),$$

for each $k>1$. And for $k=1$, we have that $Z_0=1$. Hence, $G_0(s)=E[s^1]=s$.

With the definition of the PGFs for each generation, we have the probability of extinction by generation $k$:

$$P(N \leq k) = P(Z_k=0) = G_k(0).$$

And the probability of extinction at generation $k$ should be:

$$P(N=k)=P(Z_k=0) - P(Z_{k-1}=0) = G_k(0) - G_{k-1}(0).$$

We calculated each $P(N=k)$ for each $k \in [1, 40]$ under different distributions of $X$, where $p$ is set to one of the values in $\{0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9\}$. The figure below shows the distribution of $P(N=k)$ and that of $P(N \leq k)$, as well as the ultimate distinction probability $\pi$.

From the figure, we can see that when $p<0.5$, i.e., $\mu=2(1-p)>1$, it is possible that the extinction never occurs on the population. This corresponds to the supercritical case in a branching process. Hence, to always sample an AST with a finite depth, we should set the limit that $p \geq 0.5$.

The probability mass function $P(N=k)$ graph shows that for all considered values of $p$, the distribution is heavily concentrated towards the left, resulting in a rapidly decaying curve. Specifically, a very high proportion of the extinction events occur within the first few generations. This rapid decay means that the extinction time $N$ is typically too short, i.e., the sampled ASTs are too shallow. An ideal distribution where the probability mass is spread out more evenly over a moderate range of generations to achieve an intermediate extinction time, but the current process’s inherent tendency is to extinguish almost immediately or survive indefinitely. In summary, the distribution of $N$ is too steep, and we need a process that results in a slower decay of $P(N=k)$ to shift the probability mass to higher generation numbers.

Going Controlled: A Non-Homogeneous Alternative

Okay. So how should we control the branching process? Ideally, we want the tree (population) to expand in width in the first few generations, and as it grows, we hope that the probability of expansion decreases. To achieve such desired behavior, we introduce a non-homogeneous Galton-Watson branching process. Instead of having a fixed offspring distribution $P(X=k)$ for all generations, the probability of an individual leaving zero offspring changes over time. In our binary branching model, where an individual either produces 0 or 2 offspring, we define the probability of leaving zero offspring, $p_k=P(X_k=0)$, as a smooth function of the generation number $k$. We propose the following functional form of the extinction probability:

$$p_k=\tanh(t \cdot k),$$

where $k$ is the current generation index and $t$ is a positive constant that acts as a tuning parameter that controls the rate at which $p_k$ increases. This leads to the offspring probability PGF for generation $k$ turning into

$$G_k(s)=p_k + (1 - p_k)s^2.$$

And since now we are dealing with a non-homogeneous process, the iterative production of the PGF for the population size $Z_n$ at the $n$-th generation is no longer a simple repetitive composition of the same PGF for each generation. Instead, we have the iterative relation between $G^{(n)}(s)=E[s^{Z_n}]$ and $G^{(n-1)}(s)=E[s^{Z_{n-1}}]$:

$$\begin{align} G^{(n)}(s) &= E[s^{Z_n}] = E[E[s^{Z_n}\mid Z_{n-1}]] = E[E[s^{\sum_{i=1}^{Z_{n-1}}{X_{n,i}}}]] \\ &= E[(E[s^{X_n}])^{Z_{n-1}}] = E[(G_n(s))^{Z_{n-1}}] = G^{(n-1)}(G_n(s)). \end{align}$$

Similarly to the homogeneous branching process, we also have $G^{(0)}(s)=s$, since $Z_0=1$. And for each $k\geq1$, we have that

$$G^{(n)}(s) = G_1(G_2(G_3(\dots G_n(s)\dots))).$$

With the definition of $G^{(n)}(s)$, we have the probability of extinction by generation $k$

$$P(N \leq k) = P(Z_k = 0) = G^{(n)}(0),$$

and the probability of extinction at generation $k$

$$P(N=k)=P(Z_k=0)-P(Z_{k-1}=0) = G^{(n)}(0) - G^{(n-1)}(0).$$

With the constant parameter $t$ selected from $\{0.1, 0.2, 0.3, 0.4, 0.5, 1.0\}$, we compute the probability distribution of the extinction time $N$. See the following figure.

From the left graph, it can be seen that the distribution is more concentrated around a moderate number of generations when $t<0.5$. From the right graph, we observe that no matter what value is assigned to $t$, the ultimate extinction probability is always $1$. Hence, using $\tanh$ in the offspring distribution is an appropriate choice for the binary branching process in our case.

Summary

In this blog, the problem of sampling random ASTs for a custom language was modeled as a branching process to control the resulting tree depth. The AST generation uses two actions: Mask-Down (growing the tree depth/width, corresponding to producing 2 offspring, NT) and Unmask (terminating a branch, corresponding to 0 offspring, T). The initial attempt used a homogeneous Galton-Watson branching process where the probability of terminating a branch, $p$, was constant across all generations. The offspring distribution was $P(X=0)=p$ and $P(X=2)=1-p$. The analysis showed that to guarantee a finite-depth AST, the mean number of offspring $\mu = 2(1-p)$ must be $\leq 1$, requiring $p \geq 0.5$. However, this homogeneous process resulted in a tree depth probability distribution $P(N=k)$ that decayed too rapidly, meaning most sampled ASTs were very shallow (small depth $N$). To achieve a more desirable, moderate AST depth, a non-homogeneous Galton-Watson branching process was introduced. In this model, the termination probability is no longer constant but changes with the generation index $k$: $p_k = P(X_k=0) = \tanh(t \cdot k)$, where $t$ is a tuning parameter. This function ensures that in early generations (small $k$), $p_k$ is low, encouraging the tree to grow wide, and as the tree gets deeper (large $k$), $p_k$ increases, making termination more likely. The new process utilizes the iterative PGF composition $G^{(n)}(s) = G^{(n-1)}(G_n(s))$ to calculate $P(N=k) = G^{(n)}(0) - G^{(n-1)}(0)$. The results confirmed that the non-homogeneous process, particularly for small values of $t$,e.g., $t<0.5$, spreads the probability mass over a larger range of depths, thus sampling ASTs with more moderate depths. Furthermore, the $\tanh$ function guarantees that the ultimate extinction probability is $1$, meaning all generated ASTs will eventually have a finite depth. And via this approach, we eventually arrive at a controlled branching process.

Just Created a Tree-sitter Parser: Now What?

Brandon Wong — Fri, 24 Oct 2025 20:34:03 GMT

The authors of Tree-sitter have provided detailed documentation on creating your own language parsers. So, following the tutorial, you have meticulously crafted your own grammar using tree-sitter-cli. You ran tree-sitter generate as the docs instructed and this generated the C code required to parse your language. You wrote a simple code snippet and tested parsing it with tree-sitter parse path/to/your/code. The result matches perfectly with your expectations. Everything is going under control.

Now it's time for the next step. You can't wait to integrate the parser into your project written in Python/C++/Java/... Tree-sitter provides bindings for these languages, making it easy to work with, and a list of known parsers is readily available for you to explore and use. But what about custom parsers? You go back to the documentation, and it looks like the story ends when the parser is created. Wait, what?

Actually, you're not alone in this. An issue on Tree-sitter's GitHub repository raised a similar question, but I think they were overcomplicating things. Let's keep it simple and see what the CLI has generated for us:

It looks like the bindings folder contains bindings for our parser in different languages. However, when we look closer at the subfolders, these bindings are just empty frameworks without the actual parser. In the root directory of the parser project, we also find some files like setup.py. Since I'm working with Python, I opened the setup.py script and found these lines in the setup section:

setup(
    packages=find_packages("bindings/python"),
    package_dir={"": "bindings/python"},
    package_data={
        "tree_sitter_imp": ["*.pyi", "py.typed"],
        "tree_sitter_imp.queries": ["*.scm"],
    },
    ext_package="tree_sitter_imp",
    ext_modules=[
        Extension(
            name="_binding",
            sources=[
                "bindings/python/tree_sitter_imp/binding.c",
                "src/parser.c",
            ],
            define_macros=[
                ("PY_SSIZE_T_CLEAN", None),
                ("TREE_SITTER_HIDE_SYMBOLS", None),
            ],
            include_dirs=["src"],
            py_limited_api=not get_config_var("Py_GIL_DISABLED"),
        )
    ],
    cmdclass={
        "build": Build,
        "build_ext": BuildExt,
        "bdist_wheel": BdistWheel,
        "egg_info": EggInfo,
    },
    zip_safe=False
)

Isn't this exactly what we need to combine the parser with the Python language binding? Just like in the setup scripts of general packages, the build_ext command is used to compile C/C++ extension modules for the package. It's an important part of the setuptools build process and is often used with the --inplace option. So, I set up the Python virtual environment and ran python setup.py build_ext --inplace. And there we have it! A dynamic library file named _binding.abi3.so has been built and copied into the Python language binding folder.

Now, let's test the functionality of the package. In the bindings/python/tests folder, there's a simple test script called test_binding.py. We install all the dependencies and run it, and BAM! It works. From now on, my tree_sitter_imp package folder can be copied to my project and I just need to build a parser like this:

from tree_sitter import Language, Parser
import tree_sitter_imp

my_parser = Parser(Language(tree_sitter_imp.language()))

And, that's it. Have fun with your custom parser ;-)

…Yet Another Hack

In fact, my initial solution for this issue was veeeeeery different.

Remember the generate command creates the C code for the parser, right? This means we can always build the parser as a library. Tree-sitter provides a tool for this. Using the command tree-sitter build, a dynamic library is built in the root folder of the parser project. I'm working on macOS, so it's named imp.dylib. If you're on Linux or Windows, the name will end with .so or .dll.

To build a Parser object, you need to pass a Language object to the initializer, and for custom language parsers, the initializer for Language requires a pointer for the language function, like tree_sitter_my_lang(). We only need to load the dynamic library into our project, and get handle of the function. How do we deal with dynamic libraries? Different answers for different languages. In Python, we use the ctypes package. And this is how I create a wrapper for the library:

import ctypes
from tree_sitter import Language, Parser

LANG_NAME = 'imp'
LIB_PATH = 'lib/imp.dylib'

def imp_parser() -> Parser:
    lib = ctypes.CDLL(LIB_PATH)
    lang_func = getattr(lib, 'tree_sitter_imp')
    lang_func.restype = ctypes.c_void_p
    return Parser(Language(lang_func()))

You just need to put the library under the subfolder lib and import the wrapper, and then calling imp_parser() would directly get you a parser of the language. Again, have fun! But I suggest using the first approach. Much more elegant, isn't it? 🤪