AI day in retroland. Prolog on the Commodore VIC-20! (Needs expanded RAM)

Today’s post is AI-heavy! AI as in OCR (“optical character recognition”). We will OCR (“optical character recognize”) a hex listing for a Prolog interpreter (which used to be thought of as an “AI language”) for the Commodore VIC-20! (As a bonus, some small parts of the tools I made to verify the OCR transcription were written by ChatGPT.)

As you may have heard before, OCRing stuff is error-prone. Ls and Is and 1s being mixed up makes natural language texts annoying to read, and program listings almost useless, because you’ll spend a long time trying to find the error. Why does this take a long time? Because our eyes (and attached circuitry) don’t notice tiny imperfections in a sea of details. However, we are quite good at noticing things that look completely different from the surrounds.

With hex OCR, we really only have to worry about 16 different classes (types of digit). This makes it relatively easy to verify if our OCR is correct (and perform fixes), because we can take our OCR’d digits and temporarily (while remembering their original position) display them all, sorted by class. Like this:

We can easily see that all these are indeed 3s, 4s, and 5s.

Or like this:

We can easily see that there’s a 0 in our list of 6s and a 9 in our list of 7s.

(Note: occasionally, OCR tools will turn a single character into two characters, or the other way round. That kind of problem will require manual edits.)

I created two tools, a segmentation tool, and the above verification tool, and described them a little more in this post: OCRing hex dumps (or other monospace text) and verifying the result. The tools themselves are at https://blog.qiqitori.com/ocr/monospace_segmentation_tool/ and https://blog.qiqitori.com/ocr/verification_tool/.

This is the scan in question: https://archive.org/details/Io19833/page/n342/mode/1up. Here’s the rather good-looking first page:

I/O アイ・オー 1983年3月号

For the original OCR, I used a program called ProgramListOCR. The program supports OCRing hex dumps. This program requires that you touch up input images in (e.g.) Gimp before loading them. It’s not difficult, and the program’s README describes what needs to be done. Unfortunately, this process removes a small amount of detail from the image, making it harder to distinguish between, e.g., Bs and 8s. And unfortunately, I believe the program only runs in Windows. Here’s a screenshot of the program running:

ProgramListOCR made 142 digit mistakes. The hex dump consisted of 7310 digits, so the overall error rate is 1.943%, or the accuracy is 98.057%.

How to download and run Prolog

In order to run this on your VIC-20 emulator, you need to set it to have an 8K memory expansion. Then you need to load the binary data into RAM; starting address is 2204. In VICE, you can add the memory expansion in this config window:

Select at least “Block 1 (8KiB at $2000-$3FFF)”. PAL/NTSC etc. do not matter.

To load the binary data into address $2204 and beyond, start the monitor (Alt+H), and then I wish it’d work with ‘load “/path/to/prolog.bin” 0 2204’. But for some reason that doesn’t work; the first few bytes are garbled and the reset isn’t aligned correctly. If you have this issue, try the other file and ‘load “/path/to/prolog_prefixed_with_zeros.bin” 0 2202’. Execute “m 2200” in the monitor to see if VICE loaded your file into the correct address. The following is an example of a successful load:

2200-2203 don’t matter, 2204- should be 78 a9 00 8d, etc.

Then you close the monitor and type “SYS 11445” in the BASIC prompt, and you should get something like this:

Having fun with Prolog

There are various sample programs in the magazine. Note that the Prolog interpreter sometimes gives you a question mark prompt, and sometimes a hyphen prompt. You have to delete these manually by pressing backspace (Delete), depending on what you want to do! Let’s start with this short program:

m(*a.*x)*a*x
m(*b.*x)*a(*b.*y)
-m*x*a*y
p()()
p*x(*a.*y)
-m*x*a*z
-p*z*y
-;
?p(1 2 3)*x
-pr*x
-m;
***answer***
(1 2 3)
(1 3 2)
(2 1 3)
(2 3 1)
(3 1 2)
(3 2 1)
!!fail!!

The next program (actually the first in the magazine, and easiest) is a program that tells you whether the density of blocks 1-4 is high or low, or unknown:

That’s the data and the functions, er, I mean predicates.
weight block1 heavy
weight block2 heavy
weight block3 light
weight block4 light
bulk block1 large
bulk block3 large
bulk block2 small
bulk block4 small
density *x high
-weight *x heavy
-bulk *x small
density *x low
-weight *x light
-bulk *x large
density *x ???
-weight *x heavy
-bulk *x large
density *x ???
-weight *x light
-bulk *x small
-;
?

I believe I speak for us all when I say, the syntax looks a bit weird? Anyway, the first few things are the data, er, I means facts. Then you get a function, er, predicate “signature”, and below the predicate signature you get the actual… predicate definition (the lines that start with a hyphen). (Predicates may also be called rules.) Want to finish up the current rules and start a new one with a different signature? Just backspace away the hyphen. When you’re all done, type a semicolon, and you’ll be back at the ‘?’ prompt. Now we can run queries!

In the screenshot, we first ask which blocks have a high density. The answer is BLOCK2!

Then we ask it the density of BLOCK3 and ask it the reason using the PROOF

?DENSITY BLOCK3 *X
-PROOF;

And the answer is:

/DENSITY BLOCK3 LOW ,BECAUSE*** WEIGHT BLOCK3 LIGHT ,& BULK BLOCK3 LARGE:
/Q.E.D.
DENSITY BLOCK3 LOW...IS TRUE:

OCRing hex dumps (or other monospace text) and verifying the result

Summary: Segmentation tool and OCR verification tool. You can use these tools to either verify an existing OCR’d hex dump, or use them to run your own OCR. (Which isn’t hard! You can probably get ChatGPT to produce a probably working Python script using PyTorch to learn the digits, and easily get 97% (or so) accuracy. Maybe something along the lines of, “Write a Python script that uses PyTorch to train recognition of something like MNIST, except there are 16 classes, not 10. The recognition should use convolutional layers. Input images are PNG files. Labels are in a text file.” (I just tried and the result looks plausible.))

Why hex dumps anyway? Because in the 1980s computer magazines sometimes included printed hex dumps of programs. But that’s just how I got motivated to write these tools. More on that in this post.

If you are familiar with basic image recognition concepts, you may know that detecting hand-written digits is generally considered to be a very easy task, the “hello world” of AI image recognition even. (Didn’t know this? Maybe search for “MNIST dataset”)

If recognizing handwritten digits is considered so easy, recognizing printed digits should be even easier, no? The answer is “yes” and “no”, because I left out some information above. The MNIST dataset consists of images that contain exactly one digit. OCR, on the other hand, requires segmentation. In general, recognizing typed letters if you have them in a nicely cropped single image is quite easy. (Except for letters that look very similar or even identical, of course.) Is segmentation an easy task? Well, there are all kinds of layouts out there. If you want to know more about segmentation, Andrew Ng explained the basics in this and the following few videos: https://www.youtube.com/watch?v=CykIW9hFK24&list=PLLssT5z_DsK-h9vYZkQkYNWcItqhlRJLN&index=108. These videos are part of Andrew Ng’s Machine Learning course on coursera.org, but I can’t find the specific lecture that contained this bit. (tl;dr: basically, you have a pipeline with multiple stages: first you detect regions that vaguely look like text, then a stage that detects if you have a single character or more than one character, and finally a stage that can recognize single characters.)

Performing segmentation on hex dumps and other monospace text is quite easy. However, getting the segmentation wrong can ruin the OCR. Either hardly anything will be recognized, or things will be jumbled up. I played around with Tesseract and a couple other OCR systems but wasn’t able to get good results on hex dumps. Hex dumps have the additional benefit that there are only 16 symbols that need to be recognized. One tool that work pretty well was ProgramListOCR (https://github.com/eighttails/ProgramListOCR). I think it was over 95% accurate with my input images. If it could output the segmentation too, it would be even better, in my opinion.

In this blog post I’m going to describe the tools I linked to above (Segmentation tool and OCR verification tool) and how we can use these tools to get a perfect OCR scan of a hex dump. Because let’s face it… A 99% correct hex dump isn’t all that useful, unless you enjoy sending old CPUs off the rails, or playing spot the difference.

Text segmentation/image tiling tool

The segmentation tools sort of looks like this (at the time of this writing):

Top of page (let’s skip the middle)
Bottom

What you can do here is: select an image from a file, specify the number of columns and rows, adjust the rows and columns using the buttons on the right (and then clicking somewhere in the image to that row or column smaller or larger), export to tiles. The adjustment process is best done at high zoom levels (use Ctrl+scroll wheel to zoom). You can also choose to skip certain columns when exporting. You can use the keyboard to do most things. (Cursor keys: move around, space: use current tool at cursor position, T: toggle column export tool, x/X: add/remove X offset tool, y/Y: add/remove Y offset tool.) The tiles will be output into data URLs in the text area at the bottom of the page. You can convert the data URLs back into files using the given shell code snippet. Also reproduced here:

# put contents of clipboard into a file:
xclip -o > data.txt

# convert data URLs in data.txt to PNG files:
i=0; while read -r line; do output_file=$(printf "%05d.png" $i); echo "${line:22}" | base64 -d > "$output_file"; i=$((i+1)); done < data.txt

You can of course also tile into single characters. You’ll just have to fiddle a little more with the offset tools.

OCR verification tool

Here are two screenshots that may help to get some intuition on how to use this tool:

Top of page
Bottom

This tool (at the time of this writing) expects as input 1) images as data URLs, to be pasted into the textarea at the top of the page, and 2) the predicted labels corresponding to each image.

The tool then displays the input images sorted into their classes for easy verification by a human. ;) And this is pretty easy for humans, because there are just 16 classes and the human eye is very sensitive to objects that don’t look like the surrounding objects. Here’s another screenshot to demonstrate that it should be easy to find things that look out of place:

The 0 surrounded by 6es and 9 in the cluster of 7s should be pretty easy to spot. (It is also possible that there are Bs surrounded by 8s, but that’s a different topic.)

The web page allows you to drag the images around to put them in the correct category, and to then reconstruct the labels, taking the fixes into account.

Dragging and dropping 9s mis-identified as 7s.

Here’s a more real-world example, with unpolished input images. (If you invest a couple minutes to add/remove offsets in the segmentation tool, you should get slightly better images than this.)

This is one page (around 1/3) of the entire hex dump

Raspberry Pi Pico implementation of the YM3012 DAC (mono)

Introduction

The YM3012 IC is a DAC that requires two external op amp circuits and turns a serial digital audio signal consisting of a 10-bit mantissa and 3-bit exponent into an analog signal.

I am currently investigating a fault in an audio module (SFG-01) for certain MSX computers (mostly Yamaha). This audio module is pretty capable and sports a YM2151 FM audio synthesis chip and comes with MIDI input and output ports, a connector for a digital piano keyboard, and software to use the keyboard of course. (I actually never checked if the software is in the module or in the computer.) See this for more information on the SFG-01: https://www.msx.org/wiki/Yamaha_SFG-01.

The fault becomes apparent as soon as two keys are pressed at the same time on the digital piano keyboard. You get a kind of growling/distorted effect. The audio doesn’t sound clean. (Head to the video section below to hear what it sounds like.) My first thought was, that sounds like an analog problem. Aw, I wish. I replaced a couple capacitors without any improvement whatsoever. The removed capacitors all tested fine out-of-circuit, too. A few people said it could be a problem with the op amps. One (relatively) quick way to check if that is the case, is to replace the op amps and try again. But why do it the quick and simple way (with possibly nothing to show at the end) if you can do it the slow and complicated way (with maybe something to show at the end)?

YM3012 pinout

The Raspberry Pi Pico is very good at IO. Not only do we have a lot of pins, but we can read from and write to them very, very fast. However, we aren’t going to go that fast today actually. Neither are we going to be using a lot of pins. In order to build a DAC, we need to read the CLOCK φ1, SD (DATA) and SAM1 and/or SAM2 pins. And then we need output, which in my case is a single pin outputting PWM audio. (It sounds okay, probably not exactly Hi-Fi.) My implementation only reads SAM1 and only outputs a single channel, completely discarding the other channel. It wouldn’t be too hard to get the second channel to work too — the Pico is a dual-core jobby after all, so you could just run the same code on the second core and it’d work. (As there isn’t really a lot of post-processing going on at all, you could most likely even get it to work with just a single core, but I haven’t tried.)

So, in order to test if our DAC, or one of the op amp circuits, or the filter circuits are misbehaving, we just need our Raspberry Pi Pico and check if we’re getting the faulty audio there too. If yes, the DAC is innocent. If no, the DAC or related circuitry would be implicated.

PWM audio

Researching PWM audio on the Pico, I first came across this YouTube video: https://www.youtube.com/watch?v=rwPTpMuvSXg. It turns out, however, that PWM audio is discussed in https://datasheets.raspberrypi.com/rp2040/hardware-design-with-rp2040.pdf, and the creator of the above YouTube video had mostly taken the circuit from there. Basically, you need a medium-sized capacitor to remove the DC bias, some resistors and smaller caps to filter out high-frequency components, and optionally a buffer IC. It’s all right to use a digital buffer IC (I’m using a 74-series logic hex inverter), which then drives the above-mentioned resistors and caps. (The Pico can’t output a lot of current, so I decided to include the buffer, as recommended in the PDF.)

Overview

Since the MSX and its audio module and the keyboard are museum exhibits, and the museum isn’t exactly next door (fortunately not too far away though), I only had limited time to experiment with the original hardware. So what do you do in such a case? Well, I think we all agree that any sane person would immediately head to the internets and check if anyone’s ever implemented the YM2151 (the FM synthesis chip) on an FPGA. (Well, any sane person who owns an unused FPGA. Mine is an UPduino that I bought a couple years ago. They’re actually more expensive now than back then.) As a bonus, if it turns out that the DAC is fine, we should (sometime in the future) be able to hook up our FPGA to the SFG-01 and see if it produces the same weird distorted sound. If it doesn’t, we can be reasonably sure that the YM2151 on the SFG-01 is the one causing the weird sound. (Assuming there are no bad solder joints, etc.)

It turns out that the the YM2151 does indeed exist in the form of Verilog code: https://github.com/jotego/jt51. Amazing! Thank you very much. Impressive. 😳 So all we have to do is:

  1. Put this on our FPGA
  2. Find a way to control the FPGA
  3. Connect the FPGA’s output to our DAC and experiment until it sounds okay

On 1: unfortunately our FPGA is a little bit too small to fit the entire thing. Also, the inputs and outputs are slightly different from the original chip! What do we do? Lowering the footprint of JT51 (YM2151 Verilog clone) to work on smaller FPGAs, specifically the ICE40UP5K (Part 1? WIP? Progress diary?) / UPduino mini-tutorial

On 2: I took this: https://github.com/iComputer7/RaspiPicoVGM.git. Nice work, thank you very much! And modified it to only support the YM2151, remove SD card support, and instead read the VGM data from a header file. My modified code is at https://github.com/qiqitori/RaspiPicoVGM.

On 3: that’s this post, I guess.

Debugging methodology

There were many hours spent debugging this. How do you even debug audio that sounds wrong somehow? Well, as with all debugging, you break things up into smaller things that you can actually verify to be correct (or prove incorrect):

  1. Make sure the digital data you are receiving on the Pico is the same as what the FPGA is supposed to be putting on the wire.
    1. Make the FPGA always output the same dummy value. Not the case. The most significant bit is flipped sometimes.
    2. Check if the Pico’s pio_sm_is_rx_fifo_empty() function is lying or something. Yes, looks like it.
    3. Implement a workaround. (More on that later in this post.)
  2. Audio sounds slightly better but overall still crappy.
    1. Forget about the mantissa + exponent algorithms for a second and make the FPGA output straight 16-bit signed PCM.
    2. There’s a hiss but generally speaking it sounds pretty good!
    3. Play around with the PWM audio parameters
    4. Oh wow, the hiss is gone and things sound almost perfect.
  3. Raw PCM audio sounds good, but mantissa + exponent audio still doesn’t.
    1. Make the FPGA output PCM for one sample, and mantissa + exponent of the exact same sample on the next sample.
    2. Put a hexdump in a spreadsheet and see if we can spot the problem. The mantissa + exponent samples should be exactly the same (but with some of the lower bits all 0s), but often they’re somewhat different.
    3. Fix some issues that we introduced in the FPGA code
      1. Output changes continuously and must be latched on the first clock cycle of a new sample
      2. reg/wire confusion
    4. Pico DAC’s mantissa + exponent code was slightly wrong too

The thing mentioned in 1-2 could be a bug in the Pico SDK (or documentation). I’ll probably look into that at some point. The workaround consists of reading from the FIFO twice.

Here’s a screenshot of the aforementioned spreadsheet:

The 2d layout, the conditional formatting, VLOOKUP, string processing functions all make it pretty easy to figure stuff out, in my opinion. YMMV. It would have been helpful if LibreOffice’s HEX2BIN could support more than 8 bits, but 8 bits should be enough for anybody, right?

I also used a tiny script (that I’m including below, just for my own convenience for when I need to get back to something related) to convert a hex dump into audio, using xxd and sox:

#!/bin/bash

# assumes a log file generated e.g. like this: minicom -C sample_dump1.log -D /dev/ttyACM0

tail -n +2 $1 > $1.trunc # get rid of hello world debug output
xxd -p -r $1.trunc > $1.trunc.raw
sox -c 1 -r 62000 -t u16 $1.trunc.raw -b 16 -e signed-integer $1.trunc.wav

Pic/audio/video

JT51 running on the UPduino, RaspiPicoVGM running on a Pico (top right) pico_ym3012 running on a Pico (top left)

I obtained a VGM for the YM2151 from this page: https://vgmrips.net/packs/pack/fantasy-zone-ii-dx-sega-system-16c. I chose “10 Years After ~ Cama-Ternya [Demo]”, and converted this from VGM to a header file for use with RaspiPicoVGM using xxd -i. Below is some audio of this VGM being played back using the above pictured setup. Note that it isn’t perfect, most likely due some issues on the FPGA side:

Played on JT51 controlled by RaspiPicoVGM, DAC’d by ym3012_dac

(Here’s a YouTube video of how this song is actually supposed to sound: https://www.youtube.com/watch?v=5sBDx56lv7g)

The below video shows the pico_ym3012 connected to the SFG-01 using tiny test clips, fully reproducing the growling/distorted sound that is the source of this whole investigation.

Verilog lessons learned

  • If you have a `define in one file and an `ifdef in another file, that `ifdef could very well evaluate as true.
  • Latching is pretty important
  • Executing always blocks on the correct conditions is pretty important
  • The synthesis tool won’t always catch wire vs. reg mistakes
  • Verilator will catch some things that yosys will just interpret in the probably correct way

The code

The code is also available at https://github.com/qiqitori/pico_ym3012/. License is GPLv3 for ten years after release. If there is no update saying something to the contrary, consider it public domain. I have only reproduced the major bits below.

ym3012_dac.c:

#include <stdio.h>

#include "pico/stdlib.h"
#include "pico/multicore.h"
#include "hardware/pio.h"
#include "hardware/uart.h"
#include "hardware/pwm.h"
#include "ym3012_dac.pio.h"
#include "hardware/irq.h"  // interrupts

#define PIN_BASE 0
#define AUDIO_PIN 28

// #define DEBUG 1
// #define JT51 1

#ifdef JT51
#define DESIRED_SAMPLE_RATE 62000 // 4 MHz VGM
#else
#define DESIRED_SAMPLE_RATE 57000 // 315/88 MHz / 2 / 32
#endif

uint16_t samples[110000] = { 0 };
uint16_t last_sample;

int main() {
#ifdef DEBUG
    stdio_init_all();
    sleep_ms(5000);
    printf("Hello world\n");
#endif

    // Init PWM for audio out
    gpio_set_function(AUDIO_PIN, GPIO_FUNC_PWM);
    int audio_pin_slice = pwm_gpio_to_slice_num(AUDIO_PIN);

    // Setup PWM for audio output
    // We run at around 125 MHz. If we set the pwm counter's top value (== wrap value) to 8192 (generally, bigger is better), the pwm counter can reach the top value 15258.7890625 times per second, which would be our effective sample rate. (Calculation: 125000000/8192)
    // However, our target sample rate is larger than that. Let's say if we wanted 44100 Hz: 125000000/44100 = 2834.46712018, so that's the max top value we should set.
    // However, our target sample rate is even larger than that. Let's say we want 60 KHz. Then the max top value is 2083.33333333.
    // In that case, our samples' max loudness should be about half that, 1041.66666667.
    // That's pretty close to 1024. That's good.
    // Let's not hard-code this but calculate based on the desired sample rate.
    // Note that the desired sample rate depends on the VGM tune played.
    uint16_t pwm_wrap = clock_get_hz(clk_sys)/DESIRED_SAMPLE_RATE-24; // TODO: Check if -24 actually improves anything (original intent is to buy microcontroller some time to move to the next sample -- if we don't have enough time, pwm_set_gpio_level might not make it in time and the entire next PWM cycle would be played using the level of the previous sample. I think so anyway.)
    pwm_config config = pwm_get_default_config();
    pwm_config_set_clkdiv(&config, 1.0f);
    pwm_config_set_wrap(&config, pwm_wrap);
    pwm_set_gpio_level(AUDIO_PIN, 0);
//     pwm_set_phase_correct(audio_pin_slice, true); // TODO: maybe test if this changes anything?
    pwm_init(audio_pin_slice, &config, true);

    // Init state machine for PIO
    PIO pio = pio0;
    uint sm = 0;
    uint offset = pio_add_program(pio, &ym3012_dac_program);
    ym3012_dac_init(pio, sm, offset, PIN_BASE);

#ifdef DEBUG
    for (int j = 0; j < 15; j++) {
        for (int i = 0; i < 110000; i++) {
            samples[i] = ym3012_dac_get_sample(pio, sm);
        }
        for (int i = 0; i < 110000; i+=8) {
            printf("%04x %04x %04x %04x %04x %04x %04x %04x\n", samples[i], samples[i+1], samples[i+2], samples[i+3], samples[i+4], samples[i+5], samples[i+6], samples[i+7]);
        }
    }
#else
    while (true) {
        last_sample = ym3012_dac_get_sample(pio, sm); // same as above
//         printf("%04x\n", last_sample);
        last_sample = last_sample >> 5;

        pwm_set_gpio_level(AUDIO_PIN, last_sample);
    }
#endif
}

ym3012_dac.pio:

.program ym3012_dac

; // WARNING you need to switch between JT51/YM2151/PCM code yourself by commenting/uncommenting the relevant PIO code blocks below!

; for man+exp (YM2151):
    set x, 12            ; Preload bit counter, delay until eye of first data bit
    wait 1 pin 1        ; Wait for SAM HIGH // WARNING WARNING WARNING WARNING WARNING WARNING WARNING WARNING change required on JT51: wait 0 pin 1
    wait 0 pin 1        ; Wait for SAM LOW // WARNING WARNING WARNING WARNING WARNING WARNING WARNING WARNING change required on JT51: wait 1 pin 1
    ; ignore first three bits, as specified in data sheet
    wait 1 pin 2        ; Wait for clock HIGH
    wait 0 pin 2        ; Wait for clock LOW
    wait 1 pin 2        ; Wait for clock HIGH
    wait 0 pin 2        ; Wait for clock LOW
    wait 1 pin 2        ; Wait for clock HIGH
bitloop: ; Loop x times
    wait 0 pin 2        ; Wait for clock LOW
    wait 1 pin 2        ; Wait for clock HIGH
    in pins, 1          ; Sample data
    jmp x-- bitloop     ;

; for JT51 linear signed 16-bit PCM:
; for linear s16:
;    set x, 15            ; Preload bit counter
;    wait 0 pin 1        ; Wait for SAM HIGH
;    wait 1 pin 1        ; Wait for SAM LOW
;bitloop: ; Execute following code x+1 times
;    wait 1 pin 2        ; Wait for clock HIGH
;    in pins, 1          ; Sample data
;    wait 0 pin 2        ; Wait for clock LOW
;    jmp x-- bitloop     ;

% c-sdk {
#include "hardware/clocks.h"
#include "hardware/gpio.h"

// #define YM3012_CLK 2000000 // for 4 MHz tunes
#define YM3012_CLK 1790000 // SFG-01 runs at NTSC speed
#define CLK_MULTIPLIER 8 // we need to run faster because we do "wait 1"/"wait 0"s for every transition in PIO code (and have some other extra instructions too)
#define NEGATE_EXP 1
// #define LINEAR_PCM_S16_INPUT 1
// #define DEBUG 1

static inline void ym3012_dac_init(PIO pio, uint sm, uint offset, uint pin_base) {
    pio_sm_set_consecutive_pindirs(pio, sm, pin_base, 3, false);
    pio_gpio_init(pio, pin_base);

    pio_sm_config c = ym3012_dac_program_get_default_config(offset);
    sm_config_set_in_pins(&c, pin_base);
    // Shift existing values to the right when new value comes in
    // The YM3012 receives D0 first, which is the least significant bit
#if LINEAR_PCM_S16_INPUT
    sm_config_set_in_shift(&c, true, true, 16); // signed 16-bit linear, shift to right
#else
    sm_config_set_in_shift(&c, true, true, 13); // man+exp, 10+3 bits, shift to right
#endif
    sm_config_set_fifo_join(&c, PIO_FIFO_JOIN_RX); // appears to be necessary??
    float div = (float)clock_get_hz(clk_sys) / (YM3012_CLK*8); // TODO: 4 * actual clock rate would be nice // "For example, the YM2151 internally divides the clock by 2, and has 32 operators to iterate through. Thus, for a nominal input clock of 3.58MHz, you end up at around a 55.9kHz sample rate." https://github.com/aaronsgiles/ymfm/blob/main/README.md
    sm_config_set_clkdiv(&c, div);

    pio_sm_init(pio, sm, offset, &c);
    pio_sm_set_enabled(pio, sm, true);
}

static inline uint16_t ym3012_dac_get_sample(PIO pio, uint sm) {
    // 10-bit read from the FIFO (data is left-justified)
    uint16_t data_and_exp, data, result, leading_ones;
    uint8_t exp;
    io_rw_32 *rxfifo_shift = (io_rw_32*)&(pio->rxf[sm]);
    while (pio_sm_is_rx_fifo_empty(pio, sm))
        tight_loop_contents();
    uint16_t rxfifo_contents = *rxfifo_shift; // HACK. If we don't read this twice we may get a stale?? value with the last bit sometimes missing. (HOWEVER reading thrice we get something stale again. Though maybe we're just a little late when reading the third time?) (see example below)
#ifdef LINEAR_PCM_S16_INPUT
#ifdef DEBUG
    return (uint16_t)((int16_t)(*rxfifo_shift >> 16)); // don't want that ugly offset when we're debugging
#else
    return (uint16_t)((int16_t)(*rxfifo_shift >> 16)+32768);
#endif // DEBUG
#else // !LINEAR_PCM_S16_INPUT:

    data_and_exp = (uint16_t)(*rxfifo_shift >> 19);

#ifdef NEGATE_EXP // not needed on JT51
    exp = ~((data_and_exp) >> 10) & 0b111; // top 3 bits, negated
#else
    exp = ((data_and_exp >> 10) & 0b111); // top 3 bits
#endif

    data = data_and_exp & 0b1111111111; // lower 10 bits
    if (exp == 0) { // probably doesn't happen on the JT51 at least, and shouldn't happen on YM2151 according to datasheet
        result = 0; // according to jt51_exp2lin.v
    } else {
#ifdef JT51
        result = (data << (exp-1));
        // For signed numbers (first bit of mantissa is 1) we need to sign extend by adding a bunch of ones.
        // The number of ones to be added is: 16 (because uint16_t) - (left_shift_amount (== exp-1) + 10 (mantissa length)).
        // We can create a value with the specified number of leading ones by left shifting a value that is all ones.
        // We need to shift by (16-number_of_desired_leading_ones) (e.g., 0xffff with 16 leading ones can only be achieved by left shifting by 0).
        // 16 - (16-((exp-1)+10)) = 16 - (16 - (exp-1) - 10) = 0 - -(exp-1) - -10 = (exp-1) + 10 = exp + 9
        leading_ones = 0xffff << ((exp-1) + 10);
        if (data & (1<<9)) // test for first bit of mantissa
            result |= leading_ones; // add leading ones
        result = (int16_t)result + 32768;
#else
        result = data << 6;
        result = result / (2<<(exp-1));
#endif
    }

    // related to above HACK:
    // example output of below printf demonstrating the stale output when reading the first and third times
    // first read: 0
    // third read: 715653120 or 2863136768
    // second read (>> 19): always 5461
//     0 715653120 5461 341 2 170
//     0 2863136768 5461 341 2 170
//     0 2863136768 5461 341 2 170
//     0 715653120 5461 341 2 170
//     0 2863136768 5461 341 2 170
//     0 2863136768 5461 341 2 170
//     0 715653120 5461 341 2 170
//     0 715653120 5461 341 2 170
//     0 715653120 5461 341 2 170
//     0 2863136768 5461 341 2 170
//     printf("%u %u %u %u %u %u\n", rxfifo_contents, *rxfifo_shift, data_and_exp, data, exp, result);

    return result;
#endif // LINEAR_PCM_S16_INPUT
}
%}

The scaffolding is basically the same as usual. See the Github repository for details.