<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[#dev with Brandon]]></title><description><![CDATA[#dev with Brandon]]></description><link>https://blog.brandonw3612.com</link><generator>RSS for Node</generator><lastBuildDate>Wed, 13 May 2026 12:36:45 GMT</lastBuildDate><atom:link href="https://blog.brandonw3612.com/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[Controlled Branching Processes: A Case Study in Structured Sampling]]></title><description><![CDATA[Background & Problem Statement
Earlier this week, we were developing a random AST sampler for a toy language, which is a small subset of a folklore language, IMP. It has a really simple syntax:
syntax AExp ::= Int | Id
              | AExp "/" AExp
 ...]]></description><link>https://blog.brandonw3612.com/controlled-branching-processes-a-case-study-in-structured-sampling</link><guid isPermaLink="true">https://blog.brandonw3612.com/controlled-branching-processes-a-case-study-in-structured-sampling</guid><category><![CDATA[Branching Process]]></category><category><![CDATA[Probability Distributions]]></category><category><![CDATA[ast]]></category><dc:creator><![CDATA[Brandon Wong]]></dc:creator><pubDate>Mon, 17 Nov 2025 23:18:25 GMT</pubDate><content:encoded><![CDATA[<h2 id="heading-background-amp-problem-statement">Background &amp; Problem Statement</h2>
<p>Earlier this week, we were developing a random AST sampler for a toy language, which is a small subset of a folklore language, <em>IMP</em>. It has a really simple syntax:</p>
<pre><code class="lang-plaintext">syntax AExp ::= Int | Id
              | AExp "/" AExp
              &gt; AExp "+" AExp
              | "(" AExp ")"
syntax Program ::= AExp
</code></pre>
<p>Simply, the syntax of a program contains only an arithmetic expression denoted as <code>AExp</code>. And an <code>AExp</code> node in the tiny AST could be an integer <code>Int</code>, an identifier <code>Id</code>, a division operator, an addition operator or a bracketed expression. The recursive definition in the operators enables the ASTs to expand in width and depth.</p>
<p>In our implementation, we create a concrete node type and a corresponding mask node type for each production rule in the syntax. This representation allows us to hide (partial) semantic information from a real-world or synthesized program and thus plays an important role in the forward/reverse process of a graph edit neural network.</p>
<p>To synthesize random samples for the language, we employ a top-down generation strategy, where we always start with a program with an <code>[AExpMask]</code> as its body. Next, we perform two types of actions to the nodes, to grow the AST into a semantically complete program without any masks:</p>
<ol>
<li><p><strong>Mask-Down</strong>. As we have described above, the node types are defined as a hierarchical structure, where the abstract type <code>AExp</code> is inherited by other expression types. Hence, an <code>[AExpMask]</code> node can be replaced with an <code>[IntMask]</code>, an <code>[IdMask]</code>, a <code>[DivExpMask]</code>, an <code>[AddExpMask]</code>, or a <code>[BrAExpMask]</code>.</p>
</li>
<li><p><strong>Unmask</strong>. A mask for a concrete expression type can be replaced with an instance of the concrete expression node. For terminal types in the language definition, <em>i.e.</em>, <code>Int</code> and <code>Id</code>, we replace a mask with a node with a random sampled numeric value or identifier name. For non-terminal types, we create an empty node of the concrete type, with each operand, <em>i.e.</em>, the nested expression, set as an <code>[AExpMask]</code>.</p>
</li>
</ol>
<p>We apply the actions to the masks in the AST until there remains none. In fact, the two actions can also be considered as one, which iteratively converts an <code>[AExpMask]</code> to a concrete expression node and applies the production rules to grow the tree. Between the two types of actions, it can be easily seen that <strong>Mask-Down</strong> controls the depth and width of the tree. To keep the sampled programs in a reasonable scale, the sampler performs the action based on a pre-defined set of weights.</p>
<p>To simplify the problem, we only <strong>Mask-Down</strong> an <code>[AExpMask]</code> to an <code>[IntMask]</code>, an <code>[IdMask]</code>, a <code>[DivExpMask]</code> or an <code>[AddExpMask]</code>, where the former two are terminal nodes in an AST and do not have offsprings, and the latter two are non-terminal nodes with exactly 2 offsprings. After the generation of an AST, we review its structure and add brackets to certain subtrees if necessary. Now we wonder: how should we set the weights for the <strong>Mask-Down</strong> action to gain a <em>nice</em> AST? In other words, we would like to figure out how the distribution of concrete nodes affects the scale of the resulting ASTs.</p>
<p>With the above simplification, we can formalize the problem as:</p>
<blockquote>
<p>A binary tree contains two types of nodes: NT and T, where each NT node has exactly two offsprings and each T node is a leaf node. Now, with a given distribution of the occurrence of NT and T, how can we compute the probability distribution of the the resulting trees’ depths?</p>
</blockquote>
<h2 id="heading-from-the-branching-process-perspective">From the Branching Process Perspective</h2>
<p>The above problem is actually a <a target="_blank" href="https://en.wikipedia.org/wiki/Galton%E2%80%93Watson_process">Galton-Watson branching process</a>, which can be expressed as follows:</p>
<ol>
<li>The process starts with a single ancestor in the initial generation (root node of the binary tree</li>
</ol>
<p>$$Z_0=1$$</p><ol start="2">
<li>The number of offspring for any given individual, denoted by the random variable \(X\), follows the specific probability mass function \(p_k=P(X=k)\):</li>
</ol>
<p>$$\begin{cases} p_0 = P(X=0) = p, \\ p_2 = P(X=2) = 1-p, \\ p_k = P(X=k) = 0, \text{for} \ k \in \{1, 3, 4, \dots\} \end{cases}$$</p><p>The number of descendants are independent and identically distributed (i.i.d.) for all individuals in all generations.</p>
<ol start="3">
<li>The size of the \((n+1)\)-th generation, \(Z_{n+1}\), is determined by the sum of the offspring produces by all individuals in the \(n\)-th generation, \(Z_n\). Formally, if we let \(X_{n,i}\) be the number of offspring produced by the \(i\)-th individual in generation \(Z_n\), then we have</li>
</ol>
<p>$$Z_{n+1}=\sum_{i=1}^{Z_n}X_{n+1,i},$$</p><p>where \(\{X_{n+1,i}\}\) is a sequence of i.i.d. random variables with the common offspring distribution \(P(X=k)\) defined above, and they are independent of \(Z_0, Z_1, \dots, Z_n\).</p>
<ol start="4">
<li>The mean number of offspring per individual, \(\mu\), is a crucial parameter for determining the long-term behavior of the process, where we have</li>
</ol>
<p>$$\mu=E[X]=\sum_{k=0}^\infty{k \cdot p_k} = 0 \cdot p + 2 \cdot (1 - p) = 2 (1 - p).$$</p><p>This process is a <strong>discrete-time Markov chain</strong> on the state space of non-negative integers \(\mathbb{Z}_{\geq 0}\), with the state \(Z_n=0\) being an absorbing state (extinction).</p>
<h2 id="heading-extinction-time-aka-depth-of-trees">Extinction Time aka Depth of Trees</h2>
<p>Now we try to compute the extinction time of the branching process, <em>i.e.</em>, the depth of the binary tree. The extinction of the Galton-Watson process is defined as the first generation \(N\) where the population size drops to zero. Formally, we have that</p>
<p>$$N=\inf\{n \mid Z_n = 0\},$$</p><p>where \(\inf\emptyset = \infty\). If the population never becomes extinct, <em>i.e.</em>, for all \(n\) we have \(Z_n&gt;0\), then \(N=\infty\).</p>
<p>The overall probability of ultimate extinction, \(\pi\), is the probability that the extinction time is finite:</p>
<p>$$\pi = P(N&lt;\infty) = P(\exists n \geq 1 \ . Z_n = 0).$$</p><p>For the specific branching process under our settings, the value of \(\pi\) is the smallest non-negative root of the equation</p>
<p>$$s=G(s),$$</p><p>where \(G(s)\) is the <strong>probability generating function</strong> (PGF) of the offspring distribution. More specifically, our PGF for the offspring is</p>
<p>$$G(s)=E[s^X]=\sum_{k=0}^\infty{s^kp_k}=s^0 \cdot p + s^2 \cdot (1 - p) = p + (1-p)s^2.$$</p><p>The extinction probability \(\pi\) is the smallest non-negative solution to</p>
<p>$$\pi = p + (1-p) \pi^2.$$</p><p>In general, calculating \(P(N=k)\) requires iterating the PGF. Let \(G_k(s)\) be the PGF for \(Z_k\), and we have</p>
<p>$$G_k(s)=E[s^{Z_k}]=G(G_{k-1}(s)),$$</p><p>for each \(k&gt;1\). And for \(k=1\), we have that \(Z_0=1\). Hence, \(G_0(s)=E[s^1]=s\).</p>
<p>With the definition of the PGFs for each generation, we have the probability of extinction <strong>by</strong> generation \(k\):</p>
<p>$$P(N \leq k) = P(Z_k=0) = G_k(0).$$</p><p>And the probability of extinction at generation \(k\) should be:</p>
<p>$$P(N=k)=P(Z_k=0) - P(Z_{k-1}=0) = G_k(0) - G_{k-1}(0).$$</p><p>We calculated each \(P(N=k)\) for each \(k \in [1, 40]\) under different distributions of \(X\), where \(p\) is set to one of the values in \(\{0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9\}\). The figure below shows the distribution of \(P(N=k)\) and that of \(P(N \leq k)\), as well as the ultimate distinction probability \(\pi\).</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1763350677953/ab009f64-6f7e-46fa-95ff-4098079afe0c.png" alt class="image--center mx-auto" /></p>
<p>From the figure, we can see that when \(p&lt;0.5\), <em>i.e.</em>, \(\mu=2(1-p)&gt;1\), it is possible that the extinction never occurs on the population. This corresponds to the supercritical case in a branching process. Hence, to always sample an AST with a finite depth, we should set the limit that \(p \geq 0.5\).</p>
<p>The probability mass function \(P(N=k)\) graph shows that for all considered values of \(p\), the distribution is heavily concentrated towards the left, resulting in a rapidly decaying curve. Specifically, a very high proportion of the extinction events occur within the first few generations. This rapid decay means that the extinction time \(N\) is typically too short, <em>i.e.</em>, the sampled ASTs are too shallow. An ideal distribution where the probability mass is spread out more evenly over a moderate range of generations to achieve an intermediate extinction time, but the current process’s inherent tendency is to extinguish almost immediately or survive indefinitely. In summary, the distribution of \(N\) is too steep, and we need a process that results in a slower decay of \(P(N=k)\) to shift the probability mass to higher generation numbers.</p>
<h2 id="heading-going-controlled-a-non-homogeneous-alternative">Going Controlled: A Non-Homogeneous Alternative</h2>
<p>Okay. So how should we control the branching process? Ideally, we want the tree (population) to expand in width in the first few generations, and as it grows, we hope that the probability of expansion decreases. To achieve such desired behavior, we introduce a non-homogeneous Galton-Watson branching process. Instead of having a fixed offspring distribution \(P(X=k)\) for all generations, the probability of an individual leaving zero offspring changes over time. In our binary branching model, where an individual either produces 0 or 2 offspring, we define the probability of leaving zero offspring, \(p_k=P(X_k=0)\), as a smooth function of the generation number \(k\). We propose the following functional form of the extinction probability:</p>
<p>$$p_k=\tanh(t \cdot k),$$</p><p>where \(k\) is the current generation index and \(t\) is a positive constant that acts as a tuning parameter that controls the rate at which \(p_k\) increases. This leads to the offspring probability PGF for generation \(k\) turning into</p>
<p>$$G_k(s)=p_k + (1 - p_k)s^2.$$</p><p>And since now we are dealing with a non-homogeneous process, the iterative production of the PGF for the population size \(Z_n\) at the \(n\)-th generation is no longer a simple repetitive composition of the same PGF for each generation. Instead, we have the iterative relation between \(G^{(n)}(s)=E[s^{Z_n}]\) and \(G^{(n-1)}(s)=E[s^{Z_{n-1}}]\):</p>
<p>$$\begin{align} G^{(n)}(s) &amp;= E[s^{Z_n}] = E[E[s^{Z_n}\mid Z_{n-1}]] = E[E[s^{\sum_{i=1}^{Z_{n-1}}{X_{n,i}}}]] \\ &amp;= E[(E[s^{X_n}])^{Z_{n-1}}] = E[(G_n(s))^{Z_{n-1}}] = G^{(n-1)}(G_n(s)). \end{align}$$</p><p>Similarly to the homogeneous branching process, we also have \(G^{(0)}(s)=s\), since \(Z_0=1\). And for each \(k\geq1\), we have that</p>
<p>$$G^{(n)}(s) = G_1(G_2(G_3(\dots G_n(s)\dots))).$$</p><p>With the definition of \(G^{(n)}(s)\), we have the probability of extinction <strong>by</strong> generation \(k\)</p>
<p>$$P(N \leq k) = P(Z_k = 0) = G^{(n)}(0),$$</p><p>and the probability of extinction <strong>at</strong> generation \(k\)</p>
<p>$$P(N=k)=P(Z_k=0)-P(Z_{k-1}=0) = G^{(n)}(0) - G^{(n-1)}(0).$$</p><p>With the constant parameter \(t\) selected from \(\{0.1, 0.2, 0.3, 0.4, 0.5, 1.0\}\), we compute the probability distribution of the extinction time \(N\). See the following figure.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1763419054877/e21b0439-2504-4955-b603-fd440aafe73a.png" alt class="image--center mx-auto" /></p>
<p>From the left graph, it can be seen that the distribution is more concentrated around a moderate number of generations when \(t&lt;0.5\). From the right graph, we observe that no matter what value is assigned to \(t\), the ultimate extinction probability is always \(1\). Hence, using \(\tanh\) in the offspring distribution is an appropriate choice for the binary branching process in our case.</p>
<h2 id="heading-summary">Summary</h2>
<p>In this blog, the problem of sampling random ASTs for a custom language was modeled as a branching process to control the resulting tree depth. The AST generation uses two actions: <strong>Mask-Down</strong> (growing the tree depth/width, corresponding to producing 2 offspring, <em>NT</em>) and <strong>Unmask</strong> (terminating a branch, corresponding to 0 offspring, <em>T</em>). The initial attempt used a <strong>homogeneous Galton-Watson branching process</strong> where the probability of terminating a branch, \(p\), was constant across all generations. The offspring distribution was \(P(X=0)=p\) and \(P(X=2)=1-p\). The analysis showed that to guarantee a finite-depth AST, the mean number of offspring \(\mu = 2(1-p)\) must be \(\leq 1\), requiring \(p \geq 0.5\). However, this homogeneous process resulted in a tree depth probability distribution \(P(N=k)\) that decayed too rapidly, meaning most sampled ASTs were very shallow (small depth \(N\)). To achieve a more desirable, moderate AST depth, a <strong>non-homogeneous Galton-Watson branching process</strong> was introduced. In this model, the termination probability is no longer constant but changes with the generation index \(k\): \(p_k = P(X_k=0) = \tanh(t \cdot k)\), where \(t\) is a tuning parameter. This function ensures that in early generations (small \(k\)), \(p_k\) is low, encouraging the tree to grow wide, and as the tree gets deeper (large \(k\)), \(p_k\) increases, making termination more likely. The new process utilizes the iterative PGF composition \(G^{(n)}(s) = G^{(n-1)}(G_n(s))\) to calculate \(P(N=k) = G^{(n)}(0) - G^{(n-1)}(0)\). The results confirmed that the non-homogeneous process, particularly for small values of \(t\),<em>e.g.</em>, \(t&lt;0.5\), spreads the probability mass over a larger range of depths, thus sampling ASTs with more moderate depths. Furthermore, the \(\tanh\) function guarantees that the ultimate extinction probability is \(1\), meaning all generated ASTs will eventually have a finite depth. And via this approach, we eventually arrive at a controlled branching process.</p>
]]></content:encoded></item><item><title><![CDATA[Just Created a Tree-sitter Parser: Now What?]]></title><description><![CDATA[The authors of Tree-sitter have provided detailed documentation on creating your own language parsers. So, following the tutorial, you have meticulously crafted your own grammar using tree-sitter-cli. You ran tree-sitter generate as the docs instruct...]]></description><link>https://blog.brandonw3612.com/just-created-a-tree-sitter-parser-now-what</link><guid isPermaLink="true">https://blog.brandonw3612.com/just-created-a-tree-sitter-parser-now-what</guid><category><![CDATA[tree-sitter]]></category><category><![CDATA[programming languages]]></category><category><![CDATA[compiler]]></category><category><![CDATA[parser]]></category><category><![CDATA[Python]]></category><dc:creator><![CDATA[Brandon Wong]]></dc:creator><pubDate>Fri, 24 Oct 2025 20:34:03 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1761330488121/b85939fc-4f98-437b-a419-18196b9d8ba1.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The authors of <a target="_blank" href="https://tree-sitter.github.io/tree-sitter/">Tree-sitter</a> have provided detailed <a target="_blank" href="https://tree-sitter.github.io/tree-sitter/creating-parsers/index.html">documentation</a> on creating your own language parsers. So, following the tutorial, you have meticulously crafted your own grammar using <code>tree-sitter-cli</code>. You ran <code>tree-sitter generate</code> as the docs instructed and this generated the C code required to parse your language. You wrote a simple code snippet and tested parsing it with <code>tree-sitter parse path/to/your/code</code>. The result matches perfectly with your expectations. Everything is going under control.</p>
<p>Now it's time for the next step. You can't wait to integrate the parser into your project written in Python/C++/Java/... Tree-sitter provides bindings for these languages, making it easy to work with, and a list of known parsers is readily available for you to explore and use. But what about custom parsers? You go back to the documentation, and it looks like the story ends when the parser is created. Wait, what?</p>
<p>Actually, you're not alone in this. An <a target="_blank" href="https://github.com/tree-sitter/tree-sitter/issues/643">issue</a> on Tree-sitter's GitHub repository raised a similar question, but I think they were overcomplicating things. Let's keep it simple and see what the CLI has generated for us:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1763350059565/5007666d-f356-42f6-9053-d7235e40bce9.png" alt /></p>
<p>It looks like the <code>bindings</code> folder contains bindings for our parser in different languages. However, when we look closer at the subfolders, these bindings are just empty frameworks without the actual parser. In the root directory of the parser project, we also find some files like <code>setup.py</code>. Since I'm working with Python, I opened the <code>setup.py</code> script and found these lines in the <code>setup</code> section:</p>
<pre><code class="lang-python">setup(
    packages=find_packages(<span class="hljs-string">"bindings/python"</span>),
    package_dir={<span class="hljs-string">""</span>: <span class="hljs-string">"bindings/python"</span>},
    package_data={
        <span class="hljs-string">"tree_sitter_imp"</span>: [<span class="hljs-string">"*.pyi"</span>, <span class="hljs-string">"py.typed"</span>],
        <span class="hljs-string">"tree_sitter_imp.queries"</span>: [<span class="hljs-string">"*.scm"</span>],
    },
    ext_package=<span class="hljs-string">"tree_sitter_imp"</span>,
    ext_modules=[
        Extension(
            name=<span class="hljs-string">"_binding"</span>,
            sources=[
                <span class="hljs-string">"bindings/python/tree_sitter_imp/binding.c"</span>,
                <span class="hljs-string">"src/parser.c"</span>,
            ],
            define_macros=[
                (<span class="hljs-string">"PY_SSIZE_T_CLEAN"</span>, <span class="hljs-literal">None</span>),
                (<span class="hljs-string">"TREE_SITTER_HIDE_SYMBOLS"</span>, <span class="hljs-literal">None</span>),
            ],
            include_dirs=[<span class="hljs-string">"src"</span>],
            py_limited_api=<span class="hljs-keyword">not</span> get_config_var(<span class="hljs-string">"Py_GIL_DISABLED"</span>),
        )
    ],
    cmdclass={
        <span class="hljs-string">"build"</span>: Build,
        <span class="hljs-string">"build_ext"</span>: BuildExt,
        <span class="hljs-string">"bdist_wheel"</span>: BdistWheel,
        <span class="hljs-string">"egg_info"</span>: EggInfo,
    },
    zip_safe=<span class="hljs-literal">False</span>
)
</code></pre>
<p>Isn't this exactly what we need to combine the parser with the Python language binding? Just like in the setup scripts of general packages, the <code>build_ext</code> command is used to compile C/C++ extension modules for the package. It's an important part of the setuptools build process and is often used with the <code>--inplace</code> option. So, I set up the Python virtual environment and ran <code>python setup.py build_ext --inplace</code>. And there we have it! A dynamic library file named <code>_binding.abi3.so</code> has been built and copied into the Python language binding folder.</p>
<p>Now, let's test the functionality of the package. In the <code>bindings/python/tests</code> folder, there's a simple test script called <code>test_binding.py</code>. We install all the dependencies and run it, and BAM! It works. From now on, my <code>tree_sitter_imp</code> package folder can be copied to my project and I just need to build a parser like this:</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> tree_sitter <span class="hljs-keyword">import</span> Language, Parser
<span class="hljs-keyword">import</span> tree_sitter_imp

my_parser = Parser(Language(tree_sitter_imp.language()))
</code></pre>
<p>And, that's it. Have fun with your custom parser ;-)</p>
<h2 id="heading-yet-another-hack">…Yet Another Hack</h2>
<p>In fact, my initial solution for this issue was veeeeeery different.</p>
<p>Remember the <code>generate</code> command creates the C code for the parser, right? This means we can always build the parser as a library. Tree-sitter provides <a target="_blank" href="https://tree-sitter.github.io/tree-sitter/cli/build.html">a tool for this</a>. Using the command <code>tree-sitter build</code>, a dynamic library is built in the root folder of the parser project. I'm working on macOS, so it's named <code>imp.dylib</code>. If you're on Linux or Windows, the name will end with <code>.so</code> or <code>.dll</code>.</p>
<p>To build a <code>Parser</code> object, you need to pass a <code>Language</code> object to the initializer, and for custom language parsers, the initializer for <code>Language</code> requires a pointer for the language function, like <code>tree_sitter_my_lang()</code>. We only need to load the dynamic library into our project, and get handle of the function. How do we deal with dynamic libraries? Different answers for different languages. In Python, we use the <code>ctypes</code> package. And this is how I create a wrapper for the library:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> ctypes
<span class="hljs-keyword">from</span> tree_sitter <span class="hljs-keyword">import</span> Language, Parser

LANG_NAME = <span class="hljs-string">'imp'</span>
LIB_PATH = <span class="hljs-string">'lib/imp.dylib'</span>

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">imp_parser</span>() -&gt; Parser:</span>
    lib = ctypes.CDLL(LIB_PATH)
    lang_func = getattr(lib, <span class="hljs-string">'tree_sitter_imp'</span>)
    lang_func.restype = ctypes.c_void_p
    <span class="hljs-keyword">return</span> Parser(Language(lang_func()))
</code></pre>
<p>You just need to put the library under the subfolder <code>lib</code> and import the wrapper, and then calling <code>imp_parser()</code> would directly get you a parser of the language. Again, have fun! But I suggest using the first approach. Much more elegant, isn't it? 🤪</p>
]]></content:encoded></item></channel></rss>