AI day in retroland. Prolog on the Commodore VIC-20! (Needs expanded RAM)

Today’s post is AI-heavy! AI as in OCR (“optical character recognition”). We will OCR (“optical character recognize”) a hex listing for a Prolog interpreter (which used to be thought of as an “AI language”) for the Commodore VIC-20! (As a bonus, some small parts of the tools I made to verify the OCR transcription were written by ChatGPT.)

As you may have heard before, OCRing stuff is error-prone. Ls and Is and 1s being mixed up makes natural language texts annoying to read, and program listings almost useless, because you’ll spend a long time trying to find the error. Why does this take a long time? Because our eyes (and attached circuitry) don’t notice tiny imperfections in a sea of details. However, we are quite good at noticing things that look completely different from the surrounds.

With hex OCR, we really only have to worry about 16 different classes (types of digit). This makes it relatively easy to verify if our OCR is correct (and perform fixes), because we can take our OCR’d digits and temporarily (while remembering their original position) display them all, sorted by class. Like this:

We can easily see that all these are indeed 3s, 4s, and 5s.

Or like this:

We can easily see that there’s a 0 in our list of 6s and a 9 in our list of 7s.

(Note: occasionally, OCR tools will turn a single character into two characters, or the other way round. That kind of problem will require manual edits.)

I created two tools, a segmentation tool, and the above verification tool, and described them a little more in this post: OCRing hex dumps (or other monospace text) and verifying the result. The tools themselves are at https://blog.qiqitori.com/ocr/monospace_segmentation_tool/ and https://blog.qiqitori.com/ocr/verification_tool/.

This is the scan in question: https://archive.org/details/Io19833/page/n342/mode/1up. Here’s the rather good-looking first page:

I/O アイ・オー 1983年3月号

For the original OCR, I used a program called ProgramListOCR. The program supports OCRing hex dumps. This program requires that you touch up input images in (e.g.) Gimp before loading them. It’s not difficult, and the program’s README describes what needs to be done. Unfortunately, this process removes a small amount of detail from the image, making it harder to distinguish between, e.g., Bs and 8s. And unfortunately, I believe the program only runs in Windows. Here’s a screenshot of the program running:

ProgramListOCR made 142 digit mistakes. The hex dump consisted of 7310 digits, so the overall error rate is 1.943%, or the accuracy is 98.057%.

How to download and run Prolog

In order to run this on your VIC-20 emulator, you need to set it to have an 8K memory expansion. Then you need to load the binary data into RAM; starting address is 2204. In VICE, you can add the memory expansion in this config window:

Select at least “Block 1 (8KiB at $2000-$3FFF)”. PAL/NTSC etc. do not matter.

To load the binary data into address $2204 and beyond, start the monitor (Alt+H), and then I wish it’d work with ‘load “/path/to/prolog.bin” 0 2204’. But for some reason that doesn’t work; the first few bytes are garbled and the reset isn’t aligned correctly. If you have this issue, try the other file and ‘load “/path/to/prolog_prefixed_with_zeros.bin” 0 2202’. Execute “m 2200” in the monitor to see if VICE loaded your file into the correct address. The following is an example of a successful load:

2200-2203 don’t matter, 2204- should be 78 a9 00 8d, etc.

Then you close the monitor and type “SYS 11445” in the BASIC prompt, and you should get something like this:

Having fun with Prolog

There are various sample programs in the magazine. Note that the Prolog interpreter sometimes gives you a question mark prompt, and sometimes a hyphen prompt. You have to delete these manually by pressing backspace (Delete), depending on what you want to do! Let’s start with this short program:

m(*a.*x)*a*x
m(*b.*x)*a(*b.*y)
-m*x*a*y
p()()
p*x(*a.*y)
-m*x*a*z
-p*z*y
-;
?p(1 2 3)*x
-pr*x
-m;
***answer***
(1 2 3)
(1 3 2)
(2 1 3)
(2 3 1)
(3 1 2)
(3 2 1)
!!fail!!

The next program (actually the first in the magazine, and easiest) is a program that tells you whether the density of blocks 1-4 is high or low, or unknown:

That’s the data and the functions, er, I mean predicates.
weight block1 heavy
weight block2 heavy
weight block3 light
weight block4 light
bulk block1 large
bulk block3 large
bulk block2 small
bulk block4 small
density *x high
-weight *x heavy
-bulk *x small
density *x low
-weight *x light
-bulk *x large
density *x ???
-weight *x heavy
-bulk *x large
density *x ???
-weight *x light
-bulk *x small
-;
?

I believe I speak for us all when I say, the syntax looks a bit weird? Anyway, the first few things are the data, er, I means facts. Then you get a function, er, predicate “signature”, and below the predicate signature you get the actual… predicate definition (the lines that start with a hyphen). (Predicates may also be called rules.) Want to finish up the current rules and start a new one with a different signature? Just backspace away the hyphen. When you’re all done, type a semicolon, and you’ll be back at the ‘?’ prompt. Now we can run queries!

In the screenshot, we first ask which blocks have a high density. The answer is BLOCK2!

Then we ask it the density of BLOCK3 and ask it the reason using the PROOF

?DENSITY BLOCK3 *X
-PROOF;

And the answer is:

/DENSITY BLOCK3 LOW ,BECAUSE*** WEIGHT BLOCK3 LIGHT ,& BULK BLOCK3 LARGE:
/Q.E.D.
DENSITY BLOCK3 LOW...IS TRUE:

OCRing hex dumps (or other monospace text) and verifying the result

Summary: Segmentation tool and OCR verification tool. You can use these tools to either verify an existing OCR’d hex dump, or use them to run your own OCR. (Which isn’t hard! You can probably get ChatGPT to produce a probably working Python script using PyTorch to learn the digits, and easily get 97% (or so) accuracy. Maybe something along the lines of, “Write a Python script that uses PyTorch to train recognition of something like MNIST, except there are 16 classes, not 10. The recognition should use convolutional layers. Input images are PNG files. Labels are in a text file.” (I just tried and the result looks plausible.))

Why hex dumps anyway? Because in the 1980s computer magazines sometimes included printed hex dumps of programs. But that’s just how I got motivated to write these tools. More on that in this post.

If you are familiar with basic image recognition concepts, you may know that detecting hand-written digits is generally considered to be a very easy task, the “hello world” of AI image recognition even. (Didn’t know this? Maybe search for “MNIST dataset”)

If recognizing handwritten digits is considered so easy, recognizing printed digits should be even easier, no? The answer is “yes” and “no”, because I left out some information above. The MNIST dataset consists of images that contain exactly one digit. OCR, on the other hand, requires segmentation. In general, recognizing typed letters if you have them in a nicely cropped single image is quite easy. (Except for letters that look very similar or even identical, of course.) Is segmentation an easy task? Well, there are all kinds of layouts out there. If you want to know more about segmentation, Andrew Ng explained the basics in this and the following few videos: https://www.youtube.com/watch?v=CykIW9hFK24&list=PLLssT5z_DsK-h9vYZkQkYNWcItqhlRJLN&index=108. These videos are part of Andrew Ng’s Machine Learning course on coursera.org, but I can’t find the specific lecture that contained this bit. (tl;dr: basically, you have a pipeline with multiple stages: first you detect regions that vaguely look like text, then a stage that detects if you have a single character or more than one character, and finally a stage that can recognize single characters.)

Performing segmentation on hex dumps and other monospace text is quite easy. However, getting the segmentation wrong can ruin the OCR. Either hardly anything will be recognized, or things will be jumbled up. I played around with Tesseract and a couple other OCR systems but wasn’t able to get good results on hex dumps. Hex dumps have the additional benefit that there are only 16 symbols that need to be recognized. One tool that work pretty well was ProgramListOCR (https://github.com/eighttails/ProgramListOCR). I think it was over 95% accurate with my input images. If it could output the segmentation too, it would be even better, in my opinion.

In this blog post I’m going to describe the tools I linked to above (Segmentation tool and OCR verification tool) and how we can use these tools to get a perfect OCR scan of a hex dump. Because let’s face it… A 99% correct hex dump isn’t all that useful, unless you enjoy sending old CPUs off the rails, or playing spot the difference.

Text segmentation/image tiling tool

The segmentation tools sort of looks like this (at the time of this writing):

Top of page (let’s skip the middle)
Bottom

What you can do here is: select an image from a file, specify the number of columns and rows, adjust the rows and columns using the buttons on the right (and then clicking somewhere in the image to that row or column smaller or larger), export to tiles. The adjustment process is best done at high zoom levels (use Ctrl+scroll wheel to zoom). You can also choose to skip certain columns when exporting. You can use the keyboard to do most things. (Cursor keys: move around, space: use current tool at cursor position, T: toggle column export tool, x/X: add/remove X offset tool, y/Y: add/remove Y offset tool.) The tiles will be output into data URLs in the text area at the bottom of the page. You can convert the data URLs back into files using the given shell code snippet. Also reproduced here:

# put contents of clipboard into a file:
xclip -o > data.txt

# convert data URLs in data.txt to PNG files:
i=0; while read -r line; do output_file=$(printf "%05d.png" $i); echo "${line:22}" | base64 -d > "$output_file"; i=$((i+1)); done < data.txt

You can of course also tile into single characters. You’ll just have to fiddle a little more with the offset tools.

OCR verification tool

Here are two screenshots that may help to get some intuition on how to use this tool:

Top of page
Bottom

This tool (at the time of this writing) expects as input 1) images as data URLs, to be pasted into the textarea at the top of the page, and 2) the predicted labels corresponding to each image.

The tool then displays the input images sorted into their classes for easy verification by a human. ;) And this is pretty easy for humans, because there are just 16 classes and the human eye is very sensitive to objects that don’t look like the surrounding objects. Here’s another screenshot to demonstrate that it should be easy to find things that look out of place:

The 0 surrounded by 6es and 9 in the cluster of 7s should be pretty easy to spot. (It is also possible that there are Bs surrounded by 8s, but that’s a different topic.)

The web page allows you to drag the images around to put them in the correct category, and to then reconstruct the labels, taking the fixes into account.

Dragging and dropping 9s mis-identified as 7s.

Here’s a more real-world example, with unpolished input images. (If you invest a couple minutes to add/remove offsets in the segmentation tool, you should get slightly better images than this.)

This is one page (around 1/3) of the entire hex dump

Raspberry Pi Pico implementation of the YM3012 DAC (mono)

Introduction

The YM3012 IC is a DAC that requires two external op amp circuits and turns a serial digital audio signal consisting of a 10-bit mantissa and 3-bit exponent into an analog signal.

I am currently investigating a fault in an audio module (SFG-01) for certain MSX computers (mostly Yamaha). This audio module is pretty capable and sports a YM2151 FM audio synthesis chip and comes with MIDI input and output ports, a connector for a digital piano keyboard, and software to use the keyboard of course. (I actually never checked if the software is in the module or in the computer.) See this for more information on the SFG-01: https://www.msx.org/wiki/Yamaha_SFG-01.

The fault becomes apparent as soon as two keys are pressed at the same time on the digital piano keyboard. You get a kind of growling/distorted effect. The audio doesn’t sound clean. (Head to the video section below to hear what it sounds like.) My first thought was, that sounds like an analog problem. Aw, I wish. I replaced a couple capacitors without any improvement whatsoever. The removed capacitors all tested fine out-of-circuit, too. A few people said it could be a problem with the op amps. One (relatively) quick way to check if that is the case, is to replace the op amps and try again. But why do it the quick and simple way (with possibly nothing to show at the end) if you can do it the slow and complicated way (with maybe something to show at the end)?

YM3012 pinout

The Raspberry Pi Pico is very good at IO. Not only do we have a lot of pins, but we can read from and write to them very, very fast. However, we aren’t going to go that fast today actually. Neither are we going to be using a lot of pins. In order to build a DAC, we need to read the CLOCK φ1, SD (DATA) and SAM1 and/or SAM2 pins. And then we need output, which in my case is a single pin outputting PWM audio. (It sounds okay, probably not exactly Hi-Fi.) My implementation only reads SAM1 and only outputs a single channel, completely discarding the other channel. It wouldn’t be too hard to get the second channel to work too — the Pico is a dual-core jobby after all, so you could just run the same code on the second core and it’d work. (As there isn’t really a lot of post-processing going on at all, you could most likely even get it to work with just a single core, but I haven’t tried.)

So, in order to test if our DAC, or one of the op amp circuits, or the filter circuits are misbehaving, we just need our Raspberry Pi Pico and check if we’re getting the faulty audio there too. If yes, the DAC is innocent. If no, the DAC or related circuitry would be implicated.

PWM audio

Researching PWM audio on the Pico, I first came across this YouTube video: https://www.youtube.com/watch?v=rwPTpMuvSXg. It turns out, however, that PWM audio is discussed in https://datasheets.raspberrypi.com/rp2040/hardware-design-with-rp2040.pdf, and the creator of the above YouTube video had mostly taken the circuit from there. Basically, you need a medium-sized capacitor to remove the DC bias, some resistors and smaller caps to filter out high-frequency components, and optionally a buffer IC. It’s all right to use a digital buffer IC (I’m using a 74-series logic hex inverter), which then drives the above-mentioned resistors and caps. (The Pico can’t output a lot of current, so I decided to include the buffer, as recommended in the PDF.)

Overview

Since the MSX and its audio module and the keyboard are museum exhibits, and the museum isn’t exactly next door (fortunately not too far away though), I only had limited time to experiment with the original hardware. So what do you do in such a case? Well, I think we all agree that any sane person would immediately head to the internets and check if anyone’s ever implemented the YM2151 (the FM synthesis chip) on an FPGA. (Well, any sane person who owns an unused FPGA. Mine is an UPduino that I bought a couple years ago. They’re actually more expensive now than back then.) As a bonus, if it turns out that the DAC is fine, we should (sometime in the future) be able to hook up our FPGA to the SFG-01 and see if it produces the same weird distorted sound. If it doesn’t, we can be reasonably sure that the YM2151 on the SFG-01 is the one causing the weird sound. (Assuming there are no bad solder joints, etc.)

It turns out that the the YM2151 does indeed exist in the form of Verilog code: https://github.com/jotego/jt51. Amazing! Thank you very much. Impressive. 😳 So all we have to do is:

  1. Put this on our FPGA
  2. Find a way to control the FPGA
  3. Connect the FPGA’s output to our DAC and experiment until it sounds okay

On 1: unfortunately our FPGA is a little bit too small to fit the entire thing. Also, the inputs and outputs are slightly different from the original chip! What do we do? Lowering the footprint of JT51 (YM2151 Verilog clone) to work on smaller FPGAs, specifically the ICE40UP5K (Part 1? WIP? Progress diary?) / UPduino mini-tutorial

On 2: I took this: https://github.com/iComputer7/RaspiPicoVGM.git. Nice work, thank you very much! And modified it to only support the YM2151, remove SD card support, and instead read the VGM data from a header file. My modified code is at https://github.com/qiqitori/RaspiPicoVGM.

On 3: that’s this post, I guess.

Debugging methodology

There were many hours spent debugging this. How do you even debug audio that sounds wrong somehow? Well, as with all debugging, you break things up into smaller things that you can actually verify to be correct (or prove incorrect):

  1. Make sure the digital data you are receiving on the Pico is the same as what the FPGA is supposed to be putting on the wire.
    1. Make the FPGA always output the same dummy value. Not the case. The most significant bit is flipped sometimes.
    2. Check if the Pico’s pio_sm_is_rx_fifo_empty() function is lying or something. Yes, looks like it.
    3. Implement a workaround. (More on that later in this post.)
  2. Audio sounds slightly better but overall still crappy.
    1. Forget about the mantissa + exponent algorithms for a second and make the FPGA output straight 16-bit signed PCM.
    2. There’s a hiss but generally speaking it sounds pretty good!
    3. Play around with the PWM audio parameters
    4. Oh wow, the hiss is gone and things sound almost perfect.
  3. Raw PCM audio sounds good, but mantissa + exponent audio still doesn’t.
    1. Make the FPGA output PCM for one sample, and mantissa + exponent of the exact same sample on the next sample.
    2. Put a hexdump in a spreadsheet and see if we can spot the problem. The mantissa + exponent samples should be exactly the same (but with some of the lower bits all 0s), but often they’re somewhat different.
    3. Fix some issues that we introduced in the FPGA code
      1. Output changes continuously and must be latched on the first clock cycle of a new sample
      2. reg/wire confusion
    4. Pico DAC’s mantissa + exponent code was slightly wrong too

The thing mentioned in 1-2 could be a bug in the Pico SDK (or documentation). I’ll probably look into that at some point. The workaround consists of reading from the FIFO twice.

Here’s a screenshot of the aforementioned spreadsheet:

The 2d layout, the conditional formatting, VLOOKUP, string processing functions all make it pretty easy to figure stuff out, in my opinion. YMMV. It would have been helpful if LibreOffice’s HEX2BIN could support more than 8 bits, but 8 bits should be enough for anybody, right?

I also used a tiny script (that I’m including below, just for my own convenience for when I need to get back to something related) to convert a hex dump into audio, using xxd and sox:

#!/bin/bash

# assumes a log file generated e.g. like this: minicom -C sample_dump1.log -D /dev/ttyACM0

tail -n +2 $1 > $1.trunc # get rid of hello world debug output
xxd -p -r $1.trunc > $1.trunc.raw
sox -c 1 -r 62000 -t u16 $1.trunc.raw -b 16 -e signed-integer $1.trunc.wav

Pic/audio/video

JT51 running on the UPduino, RaspiPicoVGM running on a Pico (top right) pico_ym3012 running on a Pico (top left)

I obtained a VGM for the YM2151 from this page: https://vgmrips.net/packs/pack/fantasy-zone-ii-dx-sega-system-16c. I chose “10 Years After ~ Cama-Ternya [Demo]”, and converted this from VGM to a header file for use with RaspiPicoVGM using xxd -i. Below is some audio of this VGM being played back using the above pictured setup. Note that it isn’t perfect, most likely due some issues on the FPGA side:

Played on JT51 controlled by RaspiPicoVGM, DAC’d by ym3012_dac

(Here’s a YouTube video of how this song is actually supposed to sound: https://www.youtube.com/watch?v=5sBDx56lv7g)

The below video shows the pico_ym3012 connected to the SFG-01 using tiny test clips, fully reproducing the growling/distorted sound that is the source of this whole investigation.

Verilog lessons learned

  • If you have a `define in one file and an `ifdef in another file, that `ifdef could very well evaluate as true.
  • Latching is pretty important
  • Executing always blocks on the correct conditions is pretty important
  • The synthesis tool won’t always catch wire vs. reg mistakes
  • Verilator will catch some things that yosys will just interpret in the probably correct way

The code

The code is also available at https://github.com/qiqitori/pico_ym3012/. License is GPLv3 for ten years after release. If there is no update saying something to the contrary, consider it public domain. I have only reproduced the major bits below.

ym3012_dac.c:

#include <stdio.h>

#include "pico/stdlib.h"
#include "pico/multicore.h"
#include "hardware/pio.h"
#include "hardware/uart.h"
#include "hardware/pwm.h"
#include "ym3012_dac.pio.h"
#include "hardware/irq.h"  // interrupts

#define PIN_BASE 0
#define AUDIO_PIN 28

// #define DEBUG 1
// #define JT51 1

#ifdef JT51
#define DESIRED_SAMPLE_RATE 62000 // 4 MHz VGM
#else
#define DESIRED_SAMPLE_RATE 57000 // 315/88 MHz / 2 / 32
#endif

uint16_t samples[110000] = { 0 };
uint16_t last_sample;

int main() {
#ifdef DEBUG
    stdio_init_all();
    sleep_ms(5000);
    printf("Hello world\n");
#endif

    // Init PWM for audio out
    gpio_set_function(AUDIO_PIN, GPIO_FUNC_PWM);
    int audio_pin_slice = pwm_gpio_to_slice_num(AUDIO_PIN);

    // Setup PWM for audio output
    // We run at around 125 MHz. If we set the pwm counter's top value (== wrap value) to 8192 (generally, bigger is better), the pwm counter can reach the top value 15258.7890625 times per second, which would be our effective sample rate. (Calculation: 125000000/8192)
    // However, our target sample rate is larger than that. Let's say if we wanted 44100 Hz: 125000000/44100 = 2834.46712018, so that's the max top value we should set.
    // However, our target sample rate is even larger than that. Let's say we want 60 KHz. Then the max top value is 2083.33333333.
    // In that case, our samples' max loudness should be about half that, 1041.66666667.
    // That's pretty close to 1024. That's good.
    // Let's not hard-code this but calculate based on the desired sample rate.
    // Note that the desired sample rate depends on the VGM tune played.
    uint16_t pwm_wrap = clock_get_hz(clk_sys)/DESIRED_SAMPLE_RATE-24; // TODO: Check if -24 actually improves anything (original intent is to buy microcontroller some time to move to the next sample -- if we don't have enough time, pwm_set_gpio_level might not make it in time and the entire next PWM cycle would be played using the level of the previous sample. I think so anyway.)
    pwm_config config = pwm_get_default_config();
    pwm_config_set_clkdiv(&config, 1.0f);
    pwm_config_set_wrap(&config, pwm_wrap);
    pwm_set_gpio_level(AUDIO_PIN, 0);
//     pwm_set_phase_correct(audio_pin_slice, true); // TODO: maybe test if this changes anything?
    pwm_init(audio_pin_slice, &config, true);

    // Init state machine for PIO
    PIO pio = pio0;
    uint sm = 0;
    uint offset = pio_add_program(pio, &ym3012_dac_program);
    ym3012_dac_init(pio, sm, offset, PIN_BASE);

#ifdef DEBUG
    for (int j = 0; j < 15; j++) {
        for (int i = 0; i < 110000; i++) {
            samples[i] = ym3012_dac_get_sample(pio, sm);
        }
        for (int i = 0; i < 110000; i+=8) {
            printf("%04x %04x %04x %04x %04x %04x %04x %04x\n", samples[i], samples[i+1], samples[i+2], samples[i+3], samples[i+4], samples[i+5], samples[i+6], samples[i+7]);
        }
    }
#else
    while (true) {
        last_sample = ym3012_dac_get_sample(pio, sm); // same as above
//         printf("%04x\n", last_sample);
        last_sample = last_sample >> 5;

        pwm_set_gpio_level(AUDIO_PIN, last_sample);
    }
#endif
}

ym3012_dac.pio:

.program ym3012_dac

; // WARNING you need to switch between JT51/YM2151/PCM code yourself by commenting/uncommenting the relevant PIO code blocks below!

; for man+exp (YM2151):
    set x, 12            ; Preload bit counter, delay until eye of first data bit
    wait 1 pin 1        ; Wait for SAM HIGH // WARNING WARNING WARNING WARNING WARNING WARNING WARNING WARNING change required on JT51: wait 0 pin 1
    wait 0 pin 1        ; Wait for SAM LOW // WARNING WARNING WARNING WARNING WARNING WARNING WARNING WARNING change required on JT51: wait 1 pin 1
    ; ignore first three bits, as specified in data sheet
    wait 1 pin 2        ; Wait for clock HIGH
    wait 0 pin 2        ; Wait for clock LOW
    wait 1 pin 2        ; Wait for clock HIGH
    wait 0 pin 2        ; Wait for clock LOW
    wait 1 pin 2        ; Wait for clock HIGH
bitloop: ; Loop x times
    wait 0 pin 2        ; Wait for clock LOW
    wait 1 pin 2        ; Wait for clock HIGH
    in pins, 1          ; Sample data
    jmp x-- bitloop     ;

; for JT51 linear signed 16-bit PCM:
; for linear s16:
;    set x, 15            ; Preload bit counter
;    wait 0 pin 1        ; Wait for SAM HIGH
;    wait 1 pin 1        ; Wait for SAM LOW
;bitloop: ; Execute following code x+1 times
;    wait 1 pin 2        ; Wait for clock HIGH
;    in pins, 1          ; Sample data
;    wait 0 pin 2        ; Wait for clock LOW
;    jmp x-- bitloop     ;

% c-sdk {
#include "hardware/clocks.h"
#include "hardware/gpio.h"

// #define YM3012_CLK 2000000 // for 4 MHz tunes
#define YM3012_CLK 1790000 // SFG-01 runs at NTSC speed
#define CLK_MULTIPLIER 8 // we need to run faster because we do "wait 1"/"wait 0"s for every transition in PIO code (and have some other extra instructions too)
#define NEGATE_EXP 1
// #define LINEAR_PCM_S16_INPUT 1
// #define DEBUG 1

static inline void ym3012_dac_init(PIO pio, uint sm, uint offset, uint pin_base) {
    pio_sm_set_consecutive_pindirs(pio, sm, pin_base, 3, false);
    pio_gpio_init(pio, pin_base);

    pio_sm_config c = ym3012_dac_program_get_default_config(offset);
    sm_config_set_in_pins(&c, pin_base);
    // Shift existing values to the right when new value comes in
    // The YM3012 receives D0 first, which is the least significant bit
#if LINEAR_PCM_S16_INPUT
    sm_config_set_in_shift(&c, true, true, 16); // signed 16-bit linear, shift to right
#else
    sm_config_set_in_shift(&c, true, true, 13); // man+exp, 10+3 bits, shift to right
#endif
    sm_config_set_fifo_join(&c, PIO_FIFO_JOIN_RX); // appears to be necessary??
    float div = (float)clock_get_hz(clk_sys) / (YM3012_CLK*8); // TODO: 4 * actual clock rate would be nice // "For example, the YM2151 internally divides the clock by 2, and has 32 operators to iterate through. Thus, for a nominal input clock of 3.58MHz, you end up at around a 55.9kHz sample rate." https://github.com/aaronsgiles/ymfm/blob/main/README.md
    sm_config_set_clkdiv(&c, div);

    pio_sm_init(pio, sm, offset, &c);
    pio_sm_set_enabled(pio, sm, true);
}

static inline uint16_t ym3012_dac_get_sample(PIO pio, uint sm) {
    // 10-bit read from the FIFO (data is left-justified)
    uint16_t data_and_exp, data, result, leading_ones;
    uint8_t exp;
    io_rw_32 *rxfifo_shift = (io_rw_32*)&(pio->rxf[sm]);
    while (pio_sm_is_rx_fifo_empty(pio, sm))
        tight_loop_contents();
    uint16_t rxfifo_contents = *rxfifo_shift; // HACK. If we don't read this twice we may get a stale?? value with the last bit sometimes missing. (HOWEVER reading thrice we get something stale again. Though maybe we're just a little late when reading the third time?) (see example below)
#ifdef LINEAR_PCM_S16_INPUT
#ifdef DEBUG
    return (uint16_t)((int16_t)(*rxfifo_shift >> 16)); // don't want that ugly offset when we're debugging
#else
    return (uint16_t)((int16_t)(*rxfifo_shift >> 16)+32768);
#endif // DEBUG
#else // !LINEAR_PCM_S16_INPUT:

    data_and_exp = (uint16_t)(*rxfifo_shift >> 19);

#ifdef NEGATE_EXP // not needed on JT51
    exp = ~((data_and_exp) >> 10) & 0b111; // top 3 bits, negated
#else
    exp = ((data_and_exp >> 10) & 0b111); // top 3 bits
#endif

    data = data_and_exp & 0b1111111111; // lower 10 bits
    if (exp == 0) { // probably doesn't happen on the JT51 at least, and shouldn't happen on YM2151 according to datasheet
        result = 0; // according to jt51_exp2lin.v
    } else {
#ifdef JT51
        result = (data << (exp-1));
        // For signed numbers (first bit of mantissa is 1) we need to sign extend by adding a bunch of ones.
        // The number of ones to be added is: 16 (because uint16_t) - (left_shift_amount (== exp-1) + 10 (mantissa length)).
        // We can create a value with the specified number of leading ones by left shifting a value that is all ones.
        // We need to shift by (16-number_of_desired_leading_ones) (e.g., 0xffff with 16 leading ones can only be achieved by left shifting by 0).
        // 16 - (16-((exp-1)+10)) = 16 - (16 - (exp-1) - 10) = 0 - -(exp-1) - -10 = (exp-1) + 10 = exp + 9
        leading_ones = 0xffff << ((exp-1) + 10);
        if (data & (1<<9)) // test for first bit of mantissa
            result |= leading_ones; // add leading ones
        result = (int16_t)result + 32768;
#else
        result = data << 6;
        result = result / (2<<(exp-1));
#endif
    }

    // related to above HACK:
    // example output of below printf demonstrating the stale output when reading the first and third times
    // first read: 0
    // third read: 715653120 or 2863136768
    // second read (>> 19): always 5461
//     0 715653120 5461 341 2 170
//     0 2863136768 5461 341 2 170
//     0 2863136768 5461 341 2 170
//     0 715653120 5461 341 2 170
//     0 2863136768 5461 341 2 170
//     0 2863136768 5461 341 2 170
//     0 715653120 5461 341 2 170
//     0 715653120 5461 341 2 170
//     0 715653120 5461 341 2 170
//     0 2863136768 5461 341 2 170
//     printf("%u %u %u %u %u %u\n", rxfifo_contents, *rxfifo_shift, data_and_exp, data, exp, result);

    return result;
#endif // LINEAR_PCM_S16_INPUT
}
%}

The scaffolding is basically the same as usual. See the Github repository for details.

Cloning an old (extremely difficult) puzzle game made by the company that probably invented Sokoban

Link to game for impatient readers: https://blog.qiqitori.com/tnt_bomb_bomb/
シンキングラビットのTNTボムボムのJavaScript版です。どうぞお遊びください。
以下は英語しかありません。(´・ω・`)

I like Sokoban. A while ago, I saw someone play a Sokoban-like game called T.N.T. Bomb Bomb on a Sharp MZ-1500. I wanted it and almost immediately headed to the internets to find a disk image or ROM or whatever of it. And while I could find references and YouTube videos, I couldn’t find anything playable. (Note 1: me not being able to find the ROM doesn’t mean that it really doesn’t exist, of course. In fact, maybe this isn’t the first clone of these levels either. Note 2: it is likely that this copy of the game will be properly dumped in the near future.)

Fortunately, the game is partially implemented in BASIC. Which means you could just press Shift+Break and type LIST whenever you wanted! Then you could very easily modify variables and type RUN and play with extra lives or whatever. In my case, I just wanted a picture of every level, so I added a line (line 5) to specify the level to show, hit RUN, and took a picture. Here’s an example:

Hitting enter in this state will render level 4.

(As you can see, the graphics remain on screen after breaking, and sometimes the listing is difficult to see because of this. The graphics can be cleared by executing INIT “CRT:I” in BASIC, but that will cause rendering of the next level to fail.)

It looked like I got correct views of levels 1-10, and I have added these into my JavaScript clone of the game. Levels 11-20, on the other hand, instead of displaying the level number, displayed a game tile (a wire or part of the battery) inside the upper-right corner of the screen. I have therefore not added these levels to my implementation.

Level 1, original game

My very analog way of copying levels into my clone: 1) look at picture like the one above, 2) type out an array like this:

[
    [ 0, 1, 1, 1, 1, 1, 1, 1, 0, 0 ],
    [ 0, 1, 0, 0, 0, 0, 0, 1, 0, 0 ],
    [ 0, 1, 0, 0, 0, 0, 0, 1, 1, 0 ],
    [ 1, 1, 0, 10, 11, 0, 0, 0, 1, 0 ],
    [ 1, 0, 32, 12, 13, 20, 31, 0, 1, 0 ],
    [ 1, 0, 33, 40, 41, 42, 30, 53, 1, 0 ],
    [ 1, 0, 0, 0, 0, 0, 0, 0, 1, 0 ],
    [ 1, 1, 1, 1, 1, 1, 0, 0, 1, 0 ],
    [ 0, 0, 0, 0, 0, 1, 1, 1, 1, 0 ],
    [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ]
]

(In reality I added the commas after the fact, using a single find and replace operation. I think it took an hour or two for 10 levels.)

The original game has a concept of “lives”, but I don’t think this is a valuable concept in a Sokoban-like game. So I didn’t port that over. In fact, I added functionality to make it easy to go right back to a point you were at before noticing the smell of brain fart. The game is fiendishly difficult in my opinion, I don’t think it’s necessary to make it any more difficult. In fact, if they hadn’t made it so damn difficult, maybe it would be up there with Sokoban and other famous puzzle games from the 1980s.

By the way, I’ve only solved level 1. It was super hard. Update 2023/03/01: And level 2 and 3! It probably took longer to solve level 1 than implementing the basic game logic. No guarantees that levels 2 3 4 and beyond are solvable. If you solve anything beyond level 2 3 4, please send me your sequence strings. I’ll verify them and add a note here or maybe in the game that the level has been shown to be solvable. :)

Making games like this is pretty straightforward, but:

There is one part in my implementation of this game that I think is slightly interesting. Since I do not know the solutions to the puzzles (and there may even be puzzles with multiple solutions), I wrote a small recursive function (trace_wire_path) that traces the electric path and returns true if it leads to the bomb (it starts tracing at a battery terminal). It’s not optimized at all, and I didn’t bother cleaning up the code after getting it to work for the first time (my gut feeling says that it should be easy to replace a lot of the if-thens with lookup tables), but this kind of stuff doesn’t occur too often in regular day-to-day programming, so I thought it was kind of fun. (Though it all depends on what you do for a living, I guess?) Let me know if there’s some corner case where it didn’t work for you. ;)

// for simplicity we always trace from the battery
function trace_wire_path(x, y, dir_x, dir_y) {
    // if tile at x, y is inside ELEMENTS_COMPATIBLE_WITH_POS_DIRX array
    var tile_to_check = levels[current_level][y][x];

    // is this tile compatible with the previous tile?
    if ((dir_x == 1) &&
        (ELEMENTS_COMPATIBLE_WITH_POS_DIRX.indexOf(tile_to_check) == -1))
        return false;
    else if ((dir_x == -1) &&
             (ELEMENTS_COMPATIBLE_WITH_NEG_DIRX.indexOf(tile_to_check) == -1))
        return false;
    else if ((dir_y == 1) &&
             (ELEMENTS_COMPATIBLE_WITH_POS_DIRY.indexOf(tile_to_check) == -1))
        return false;
    else if ((dir_y == -1) &&
             (ELEMENTS_COMPATIBLE_WITH_NEG_DIRY.indexOf(tile_to_check) == -1))
        return false;

    // are we done? (we already know we must be on the right side)
    if ((tile_to_check == TNT_BOTTOM_LEFT) ||
        (tile_to_check == TNT_BOTTOM_RIGHT))
        return true;

    // what's our new direction?
    if ((tile_to_check == HORIZONTAL_WIRE) ||
        (tile_to_check == VERT_WIRE)) {
        new_dir_x = dir_x;
        new_dir_y = dir_y;
    } else if ((tile_to_check == CORNER_WIRE_NW) ||
                (tile_to_check == CORNER_WIRE_SW) ||
                (tile_to_check == CORNER_WIRE_NE) ||
                (tile_to_check == CORNER_WIRE_SE)) {
        if (dir_x) { // dir_x is 1 or -1
            new_dir_x = 0;
            if ((tile_to_check == CORNER_WIRE_NW) ||
                (tile_to_check == CORNER_WIRE_NE))
                new_dir_y = -1; // up
            else if ((tile_to_check == CORNER_WIRE_SW) ||
                     (tile_to_check == CORNER_WIRE_SE))
                new_dir_y = 1; // down
        } else { // dir_x is 0
            new_dir_y = 0;
            if ((tile_to_check == CORNER_WIRE_NW) ||
                (tile_to_check == CORNER_WIRE_SW))
                new_dir_x = -1;
            else if ((tile_to_check == CORNER_WIRE_NE) ||
                     (tile_to_check == CORNER_WIRE_SE))
                new_dir_x = 1;
        }
    }

    // recurse
    return trace_wire_path(x+new_dir_x, y+new_dir_y, new_dir_x, new_dir_y);
}

Performance

It shouldn’t be a big deal to leave this game open in a tab somewhere. Virtually no CPU and not a lot of memory should be in use when nothing is happening. about:performance snapshot with the game just sitting there, waiting for user input:

No quantifiable energy impact; memory is low too.

In case anyone wants pictures of levels 11-20, which I haven’t included in my clone because they looked a bit suspicious:

11
12
13
14
15
16
17
18
19
20

Yeah, I don’t quite get why the battery isn’t inside the playfield, and there’s no TNT either… If anyone wants to convert these levels into my format, patches are welcome. :)

Copyrights

Copyright status of my clone: I recreated the original graphics in Inkscape. I do not claim any copyright on the graphics. As they are recreated somewhat faithfully, the graphics probably are technically pirated and not copyrightable. The code may or may not be copyrightable. If it is, let’s say it’s GPLv3 for now. However, I disclaim all copyright after release + 15 years.

Lowering the footprint of JT51 (YM2151 Verilog clone) to work on smaller FPGAs, specifically the ICE40UP5K (Part 1? WIP? Progress diary?) / UPduino mini-tutorial

Update 2023/03/09

I will upload the changes necessary to run the JT51 as a drop-in replacement of a real YM2151 relatively soon. Things aren’t 100% ironed out yet.

Update 2023/03/06

The below update states that there are errors in jt51_phrom and jt51_exprom.v, but these errors were minor and have been fixed. However, the fixed jt51_phrom.v doesn’t appear to have a large effect on the final number of LUT4s used. It looks like the mistake I had originally made (a race condition-type of mistake) was responsible for the majority of the savings. Boo.

Here’s a short sound recording with the mistake left in:

And here’s a short sound recording with the mistake ironed out:

In addition, the changes to jt51_sh.v mentioned in the below update might suffer from some problems too. So far I have only managed to run with jt51_sh8 enabled, so I have no way to compare the unmodified jt51_sh implementation to my modified implementation, but I also tried adding jt51_sh10 for another shift register, and that made things sound rather weird. It’s currently not clear to me why that is the case.

Important update 2023/03/01

I finally managed to test the modified code. Do not use it, there are probably errors in it. Using the modified sine tables (jt51_phrom.v) causes everything to sound noisy. Using the modified exprom.v messes something up, but the effect is rather subtle.

Instead, you can save on LUTs by modifying jt51_sh.v as follows. This is the original code:

module jt51_sh #(parameter width=5, stages=32, rstval=1'b0 ) (
    input                           rst,
    input                           clk,
    input                           cen,
    input       [width-1:0]         din,
    output      [width-1:0]         drop
);

reg [stages-1:0] bits[width-1:0];

genvar i;
generate
    for (i=0; i < width; i=i+1) begin: bit_shifter
        always @(posedge clk, posedge rst) begin
            if(rst)
                bits[i] <= {stages{rstval}};
            else if(cen)
                bits[i] <= {bits[i][stages-2:0], din[i]};
        end
        assign drop[i] = bits[i][stages-1];
    end
endgenerate

endmodule

It looks like the logic yosys synthesizes from this code is inefficient. I haven’t looked too much into it, but writing this code out (and removing one of the channels, etc.) causes yosys to synthesize more efficient code. As you can see, this code uses parameters that affect the way it is generated. I just picked one set of parameters that appeared multiple times, width=14 and stages=8, and that was enough to get the logic to just fit. I.e., I appended the following code inside the same file:

module jt51_sh8 #(parameter rstval=1'b0 ) (
    input                           rst,
    input                           clk,
    input                           cen,
    input       [13:0]         din,
    output      [13:0]         drop
);

reg [7:0] bits[13:0];

// jt51_sh #( .width(14), .stages(8)) prev1_buffer(
// reg [7:0] bits[13:0]

genvar i;
// generate
//     for (i=0; i < 14; i=i+1) begin: bit_shifter
//         always @(posedge clk, posedge rst) begin
//             if(rst)
//                 bits[i] <= {8{rstval}};
//             else if(cen)
//                 bits[i] <= {bits[i][6:0], din[i]};
//         end
//         assign drop[i] = bits[i][7];
//     end
// endgenerate
        always @(posedge clk, posedge rst) begin
            if(rst) begin
                bits[0] <= {8{rstval}};
                bits[1] <= {8{rstval}};
                bits[2] <= {8{rstval}};
                bits[3] <= {8{rstval}};
                bits[4] <= {8{rstval}};
                bits[5] <= {8{rstval}};
                bits[6] <= {8{rstval}};
                bits[7] <= {8{rstval}};
                bits[8] <= {8{rstval}};
                bits[9] <= {8{rstval}};
                bits[10] <= {8{rstval}};
                bits[11] <= {8{rstval}};
                bits[12] <= {8{rstval}};
                bits[13] <= {8{rstval}};
            end
            else if(cen) begin
                bits[0] <= {bits[0][6:0], din[0]};
                bits[1] <= {bits[1][6:0], din[1]};
                bits[2] <= {bits[2][6:0], din[2]};
                bits[3] <= {bits[3][6:0], din[3]};
                bits[4] <= {bits[4][6:0], din[4]};
                bits[5] <= {bits[5][6:0], din[5]};
                bits[6] <= {bits[6][6:0], din[6]};
                bits[7] <= {bits[7][6:0], din[7]};
                bits[8] <= {bits[8][6:0], din[8]};
                bits[9] <= {bits[9][6:0], din[9]};
                bits[10] <= {bits[10][6:0], din[10]};
                bits[11] <= {bits[11][6:0], din[11]};
                bits[12] <= {bits[12][6:0], din[12]};
                bits[13] <= {bits[13][6:0], din[13]};
            end
        end
        assign drop[0] = bits[0][7];
        assign drop[1] = bits[0][7];
        assign drop[2] = bits[0][7];
        assign drop[3] = bits[0][7];
        assign drop[4] = bits[0][7];
        assign drop[5] = bits[0][7];
        assign drop[6] = bits[0][7];
        assign drop[7] = bits[0][7];
        assign drop[8] = bits[0][7];
        assign drop[9] = bits[0][7];
        assign drop[10] = bits[0][7];
        assign drop[11] = bits[0][7];
        assign drop[12] = bits[0][7];
        assign drop[13] = bits[0][7];
endmodule

And adjusted jt51_op.v to use jt51_sh8 instead of jt51_sh for prev1_buffer, prevprev1_buffer, and prev2_buffer.

Original post follows:

Quick summary

I took JT51 (https://github.com/jotego/jt51) and shrunk it down a little. I got it down to just barely fit. There are some lookup tables that are processed down by a couple hundred LUT4s, I made the lookup tables contain the already processed values instead. We’re now using slightly more RAM.

How we got here

I am currently debugging a YM2151-based device, the Yamaha SFG-01 sound module for MSX PCs. There is… wonky audio when two notes are played at once on the attached keyboard. I started off by emulating the YM3012 DAC on a Raspberry Pi Pico. More on that in a future post. More on the whole repair in a future post, in fact. My plan was to run the original YM2151 and the FPGA version side-by-side (with the exact same inputs) and to compare the audio outputs. However, after I already did most things detailed in this post, I realized that plan probably wasn’t going to work, as (if I read the datasheet correctly) the YM2151 generates interrupts which probably have to be acknowledged, and the data bus is bidirectional, and actually does get read out by the CPU occasionally. So the original chip and the FPGA would have to work in 100% perfect sync, and who knows how achievable that is.

I have two FPGA boards, and they’re both exactly the same, UPduino v3.0. I bought these back in 2020 or so, expecting I’d maybe come up with a project at some point. They were cheaper back then! I paid 43.20 USD + 6 USD shipping for 2! So per device, in JPY at that time: 21.6 * 103 = 2225 JPY. Currently, the price is $30 per device, and USD/JPY is 133.8. 30 * 133.8 = 4014 JPY, so almost double. Yikes.

Only have an ICE40UP3K? Allegedly, if you use the open-source toolchain, it’ll have exactly the same amount of LUT4s available as an ICE40UP5K. Apparently it’s just the official IDE enforcing an artificial limit?

So all I’d done up to this point was: I installed the open-source toolchain, changed the speed of the LED blinking example, re-flashed, and got some satisfaction that it all worked. Let’s start from that point. I think the official tutorials should get you there (except for the speed change maybe).

Also: important: I haven’t tested my revised Verilog yet. That’s something for part 2 (not done/written yet).

Going beyond the rgb_blink example

This is the first time I’m compiling feral Verilog code for this board, so I took notes along the way. This blog post is just what I’d written down, just polished a little. First of all, make sure you can compile and flash the rgb_blink example. Follow the documentation, at the very least https://upduino.readthedocs.io/en/latest/getting_started/tool_installation.html and https://upduino.readthedocs.io/en/latest/tutorials/blink_led.html.

Then, git clone https://github.com/jotego/jt51. Copy UPduino-v3.0/RTL/common from the toolchain to jt51/ and UPduino-v3.0/RTL/blink_led/Makefile to jt51/hdl/. Perhaps cd to jt51/hdl and modify the Makefile as follows.

Note: Makefiles consist of rules laying out how to build a certain file. Rule blocks start like this: “filename: dependencies”. The dependencies are filenames. There is only one rule in our Makefile that directly depends on .v files:

rgb_blink.json: rgb_blink.v

Instead of rgb_blink.v, we’ll replace that by all the jt51_….v files we have in jt51/hdl:

jt51_acc.v jt51_csr_ch.v jt51_csr_op.v jt51_eg.v jt51_exp2lin.v jt51_exprom.v jt51_kon.v jt51_lfo.v jt51_lin2exp.v jt51_mmr.v jt51_mod.v jt51_noise_lfsr.v jt51_noise.v jt51_op.v jt51_pg.v jt51_phinc_rom.v jt51_phrom.v jt51_pm.v jt51_reg.v jt51_sh.v jt51_timers.v jt51.v

Then also change the vosys command to synthesize from these .v files instead of rgb_blink.v:

yosys -q -p "synth_ice40 -json rgb_blink.json" jt51_acc.v jt51_csr_ch.v jt51_csr_op.v jt51_eg.v jt51_exp2lin.v jt51_exprom.v jt51_kon.v jt51_lfo.v jt51_lin2exp.v jt51_mmr.v jt51_mod.v jt51_noise_lfsr.v jt51_noise.v jt51_op.v jt51_pg.v jt51_phinc_rom.v jt51_phrom.v jt51_pm.v jt51_reg.v jt51_sh.v jt51_timers.v jt51.v

And finally, let’s change all names from “rgb_blink” to “jt51” using search and replace: “rgb_blink” -> “jt51”. You should end up with a Makefile like this:

# Makefile to build UPduino v3.0 rgb_blink.v  with icestorm toolchain
# Original Makefile is taken from: 
# https://github.com/tomverbeure/upduino/tree/master/blink
# On Linux, copy the included upduinov3.rules to /etc/udev/rules.d/ so that we don't have
# to use sudo to flash the bit file.
# Thanks to thanhtranhd for making changes to thsi makefile.

rgb_blink.bin: rgb_blink.asc
	icepack rgb_blink.asc rgb_blink.bin

rgb_blink.asc: rgb_blink.json ../common/upduino.pcf
	nextpnr-ice40 --up5k --package sg48 --json rgb_blink.json --pcf ../common/upduino.pcf --asc rgb_blink.asc   # run place and route

rgb_blink.json: rgb_blink.v
	yosys -q -p "synth_ice40 -json rgb_blink.json" rgb_blink.v

.PHONY: flash
flash:
	iceprog -d i:0x0403:0x6014 rgb_blink.bin

.PHONY: clean
clean:
	$(RM) -f rgb_blink.json rgb_blink.asc rgb_blink.bin

Make sure you have tab characters, not space characters in the rule block indentation. (Trap for young players.) Make sure you also copied the common/ directory as instructed above. Then, execute “make”. If you get the following error:

$ make
nextpnr-ice40 --up5k --package sg48 --json jt51.json --pcf ../common/upduino.pcf --asc jt51.asc   # run place and route
/bin/sh: 1: nextpnr-ice40: not found
make: *** [Makefile:12: jt51.asc] Error 127

That means you need nextpnr-ice40 in your PATH. Figure out the path, and then execute:

PATH=$PATH:/path/to/directory/containing/nextpnr-ice40

Next, you should get the following error:

$ make
nextpnr-ice40 --up5k --package sg48 --json jt51.json --pcf ../common/upduino.pcf --asc jt51.asc   # run place and route
ERROR: IO 'xright[15]' is unconstrained in PCF (override this error with --pcf-allow-unconstrained)
ERROR: Loading PCF failed.
0 warnings, 2 errors
make: *** [Makefile:12: jt51.asc] Error 255

For now, override this error as instructed, by changing the nextpnr-ice40 command in the Makefile as follows:

nextpnr-ice40 --up5k --package sg48 --json jt51.json --pcf ../common/upduino.pcf --asc jt51.asc --pcf-allow-unconstrained

At this point we’ll finally get some actually interesting error output.

As-is, the project doesn’t fit on the ICE40

...
Info: Device utilisation:
Info:            ICESTORM_LC:  6680/ 5280   126%
Info:           ICESTORM_RAM:     6/   30    20%
Info:                  SB_IO:    91/   96    94%
Info:                  SB_GB:     8/    8   100%
Info:           ICESTORM_PLL:     0/    1     0%
Info:            SB_WARMBOOT:     0/    1     0%
Info:           ICESTORM_DSP:     0/    8     0%
Info:         ICESTORM_HFOSC:     0/    1     0%
Info:         ICESTORM_LFOSC:     0/    1     0%
Info:                 SB_I2C:     0/    2     0%
Info:                 SB_SPI:     0/    2     0%
Info:                 IO_I3C:     0/    2     0%
Info:            SB_LEDDA_IP:     0/    1     0%
Info:            SB_RGBA_DRV:     0/    1     0%
Info:         ICESTORM_SPRAM:     0/    4     0%

Info: Placed 0 cells based on constraints.
ERROR: Unable to place cell '$abc$113462$auto$blifparse.cc:492:parse_blif$114175_LC', no BELs remaining to implement cell type 'ICESTORM_LC'
91 warnings, 1 error
make: *** [Makefile:13: jt51.asc] Error 255

Okay, first things first. How old is our toolchain?

$ yosys -V
Yosys 0.8 (git sha1 5706e90)

Let’s see, the newest version of yosys, at the time of this writing, is… 0.26. Wait what? Ah, it looks like a smaller number, but is probably intended to be a larger number. It appears that my version is from 2018. Likely, I’d just installed it from Debian’s repositories. Let’s try building yosys from Git so we can upgrade from 0.8 to 0.26. It would like to build using clang by default, but you can build using gcc too. You also need tcl8.6-dev (or probably other versions work fine too).

$ git clone https://github.com/YosysHQ/yosys
$ cd yosys
$ make
/bin/sh: 1: clang: not found
[  0%] Building kernel/version_4c334b905.cc
[  0%] Building kernel/version_4c334b905.o
/bin/sh: 1: clang: not found
make: *** [Makefile:754: kernel/version_4c334b905.o] Error 12
$ make config-gcc
...
In file included from kernel/calc.cc:24:
./kernel/yosys.h:81:12: fatal error: tcl.h: No such file or directory
 #  include <tcl.h>
...
$ sudo apt-get install tcl8.6-dev
...
$ make config-gcc
...
$ # success

And if we try synthesizing again now, we do get a significant improvement. (Also synthesis time is faster I think.) But we are not quite there yet:

Info: Device utilisation:
Info:            ICESTORM_LC:  5836/ 5280   110%
Info:           ICESTORM_RAM:     3/   30    10%
Info:                  SB_IO:    91/   96    94%
Info:                  SB_GB:     8/    8   100%
Info:           ICESTORM_PLL:     0/    1     0%
Info:            SB_WARMBOOT:     0/    1     0%
Info:           ICESTORM_DSP:     0/    8     0%
Info:         ICESTORM_HFOSC:     0/    1     0%
Info:         ICESTORM_LFOSC:     0/    1     0%
Info:                 SB_I2C:     0/    2     0%
Info:                 SB_SPI:     0/    2     0%
Info:                 IO_I3C:     0/    2     0%
Info:            SB_LEDDA_IP:     0/    1     0%
Info:            SB_RGBA_DRV:     0/    1     0%
Info:         ICESTORM_SPRAM:     0/    4     0%

Shrinking the footprint by changing yosys options (using DSP cells)

110% isn’t too far from where we need to be, so let’s investigate if we can do anything to reduce our FPGA footprint. First of all, there are three files that include the word ‘rom’, which may have a significant effect on our footprint. But it looks like our toolchain is clever — it actually uses ICESTORM_RAM to implement the ROM. (Replacing the entire case/endcase block in the rather large jt51_phinc_rom.v file with a single statement reduced the LC count by 2-3%, and ICESTORM_RAM from 10% to 0%.)

Next, we forget about yosys for a second, and attempt to synthesize this using the official toolchain from Lattice, IceCube2. You’ll need an account and follow a link to generate a license file. You need to enter a MAC address to bind the license to a certain computer. (Or maybe a computer with a certain network adapter.)

IceCube2’s synthesis finishes in a few seconds, and only uses 11 logic cells. Hmm, so efficient! Or more likely, something’s weird. And yes, indeed it’s getting confused and thinks that jt51_noise_lfsr.v is the main file. Apparently, this file’s modules aren’t actually used anywhere. So we get rid of that file (and also get rid of it in our Makefile above) and re-synthesize. Synthesis finishes successfully, and apparently uses 1698 LUTs. Hmm, really? (No, but let me go off a quick tangent first.)

Okay, let’s assume for a second that yosys is much, much worse than IceCube2. It’s time to google for something like ‘yosys vs icecube2’. A person on the EEVblog forums says, “The IceCube2 generates smaller and faster design (most visible with larger designs) than the IceStorm does, it can infer ie. multipliers with built-in DSP modules (UP5k) etc. The IceStorm is less effective, and infers ie. multipliers in fabric (you have to instantiate the modules/primitives manually).” Hmm, interesting. Well, it turns out you can enable the DSP modules in yosys using the -dsp option, so we modify the Makefile as follows:

yosys -q -p "synth_ice40 -dsp -json jt51.json" jt51_acc.v jt51_csr_ch.v jt51_csr_op.v jt51_eg.v jt51_exp2lin.v jt51_exprom.v jt51_kon.v jt51_lfo.v jt51_lin2exp.v jt51_mmr.v jt51_mod.v jt51_noise.v jt51_op.v jt51_pg.v jt51_phinc_rom.v jt51_phrom.v jt51_pm.v jt51_reg.v jt51_sh.v jt51_timers.v jt51.v

That reduces our LUT count by ~2% percent. Every percent counts, but we’re not quite there yet. Looking at https://github.com/YosysHQ/yosys/blob/master/techlibs/ice40/synth_ice40.cc, we see a few more options we could try, e.g., -spram, -noabc, -abc2, -abc9 (experimental), -flowmap (experimental).

-noabc brings us back up to 120%. -flowmap also increases the number of logic cells to a similar number. -abc2 eliminates 19 logic cells vs. just -abc, but that’s not a lot, and our percentage doesn’t change. -abc9 doesn’t yield much of an improvement either. Hmm, looks like we’ve exhausted some of the lower hanging fruit. Anyway, let’s take another closer look at the official toolchain’s output. When your eyes get a little more used to its output you actually notice that it says:

Cell usage:
GND             39 uses
SB_CARRY        366 uses
SB_DFF          22 uses
SB_DFFE         276 uses
SB_DFFER        2709 uses
SB_DFFES        747 uses
SB_DFFESR       8 uses
SB_DFFESS       10 uses
SB_DFFR         29 uses
SB_DFFS         1 use
SB_DFFSR        23 uses
SB_GB           3 uses
SB_RAM1024x4    3 uses
VCC             39 uses
SB_MAC16        2 uses
    MULTONLY    1 use
    MULTADD     1 use
SB_LUT4         1698 uses

Hey. 1698 LUTs, but 3825 DFFs, and the P&R Flow tool confirms this:

Number of LUTs      :   1698
Number of DFFs      :   3825
Number of Carrys    :   366

These DFFs also use up LUTs, so the total number of LUTs used is 5523, which is actually extremely close to yosys, and also too much. (Note that I already edited the Verilog a little bit at this point, so the number on an unmodified repository would be a little higher.)

Let’s remove the -q option from yosynth’s synth_ice40 command in the Makefile, and take a look at the output close to the summary that we looked at before. Scrolling way past a lot of verbose output, we get a summary like the following, and can see that yosys is indeed very close.

Info: Packing constants..
Info: Packing IOs..
Info: Packing LUT-FFs..
Info:     1462 LCs used as LUT4 only
Info:      515 LCs used as LUT4 and DFF
Info: Packing non-LUT FFs..
Info:     3367 LCs used as DFF only
Info: Packing carries..
Info:      184 LCs used as CARRY only
Info: Packing indirect carry+LUT pairs...
Info:       63 LUTs merged into carry LCs

Shrinking the footprint by removing features

Next, we could try and cut down on features in order to reduce the required number of logic cells. First of all, I nuked the entire right channel (“right” and “xright”) by commenting out a couple lines in jt51.v and jt51_acc.v. That shaved off about 2%. I kept “xleft” but also got rid of the converted “left”. That means we no longer need to compile jt51_exp2lin.v, which seems to save 9 LUTs.

Shrinking the footprint by trading LUTs for RAM

A cursory (liar liar pants on fire) glance over the code revealed an opportunity to potentially save a more significant number of LUTs. In jt51_op.v, we refer to a sine table (which is in jt_phrom.v) and concatenate certain bits from this table. In the following snippet, the sine table is already in the sta_XI register:

case( phaselo_XI[7:6] )
    2'b00: stb = { 10'b0, sta_XI[29], sta_XI[25], 2'b0, sta_XI[18], 
        sta_XI[14], 1'b0, sta_XI[7] , sta_XI[3] };
    2'b01: stb = { 6'b0 , sta_XI[37], sta_XI[34], 2'b0, sta_XI[28], 
        sta_XI[24], 2'b0, sta_XI[17], sta_XI[13], sta_XI[10], sta_XI[6], sta_XI[2] };
    2'b10: stb = { 2'b0, sta_XI[43], sta_XI[41], 2'b0, sta_XI[36],
        sta_XI[33], 2'b0, sta_XI[27], sta_XI[23], 1'b0, sta_XI[20],
        sta_XI[16], sta_XI[12], sta_XI[9], sta_XI[5], sta_XI[1] };
    2'b11: stb = {
            sta_XI[45], sta_XI[44], sta_XI[42], sta_XI[40]
        , sta_XI[39], sta_XI[38], sta_XI[35], sta_XI[32]
        , sta_XI[31], sta_XI[30], sta_XI[26], sta_XI[22]
        , sta_XI[21], sta_XI[19], sta_XI[15], sta_XI[11]
        , sta_XI[8], sta_XI[4], sta_XI[0] };
    default: stb = 19'dx;

If you are new to Verilog, numbers often look like this: <total bit width>'<letter indicating number format, e.g., b for binary><number>. The array indices refer to bit numbers. E.g., sta_XI[38] is bit 38 in sta_XI, counting from 0. “case” is like a switch statement in C. So up here, we do something like:

switch(bits 7 and 6 of phaselo_XI) {
    case 0: ...;
    case 1: ...;
    case 2: ...;
    case 3: ...;
    default: ...;
}

(The “default” clause is extraneous, but doesn’t cause harm.)

The sine table is fairly large, at 32 entries of 46 bits. In the above code snippet, we pick (to me, super random) bits from the table and also insert constant 0s and 1s here and there. E.g., the first line reads in plain words: ten 0s, followed by sinetable[i][29], followed by sinetable[i][25], followed by two 0s, etc. The sine table isn’t used anywhere else.

Our opportunity is: instead of generating a circuit to combine bits from the sinetable together, we can just rewrite the sine lookup table to already contain what we call stb above. It doesn’t matter if our table ends up a little larger (it could be up to four times larger), because as mentioned above, RAM is used to store these tables. But our table isn’t that much larger, really. Before we had 32×46=1472 bits, now we have a three-dimensional array of dimensions 4x32x19=2432 bits, not even twice as large.

This optimization takes us to 5363/5280 (101%), which means we’re almost done! (If we use four two-dimensional arrays and a case block, the savings are much less pronounced, 104%.) Of course, there is no free lunch: we now use more RAM: ICESTORM_RAM 5/30 (16%). Before it was 3/30 (10%). But we still have a lot of RAM left.

Rewriting the table by hand presumably gets old quickly, so I wrote a short Perl script to do it. (Luckily, it can sometimes be very easy to transform Verilog source code to Perl using find and replace with regular expressions.)

#!/usr/bin/perl

$sta_XI = [[0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0],
    [0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1],
    [0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0],
    [0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0],
    [0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1],
    [0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0],
    [0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0],
    [0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0],
    [0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1],
    [0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0],
    [0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0],
    [0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1],
    [0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1],
    [0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0],
    [0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0],
    [0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1],
    [0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1],
    [0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1],
    [0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1],
    [0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1],
    [0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1],
    [0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1],
    [0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1],
    [0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1],
    [0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1],
    [0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0],
    [0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1],
    [1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0],
    [1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1],
    [1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0],
    [1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1],
    [1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1]];

for $i (0..31) {
    $stb->[0][$i] = [ (0)x10, $sta_XI->[$i][29], $sta_XI->[$i][25], (0)x2, $sta_XI->[$i][18], 
            $sta_XI->[$i][14], 0, $sta_XI->[$i][7] , $sta_XI->[$i][3] ];
    $stb->[1][$i] = [ (0)x6, $sta_XI->[$i][37], $sta_XI->[$i][34], (0)x2, $sta_XI->[$i][28], 
            $sta_XI->[$i][24], (0)x2, $sta_XI->[$i][17], $sta_XI->[$i][13], $sta_XI->[$i][10], $sta_XI->[$i][6], $sta_XI->[$i][2] ];
    $stb->[2][$i] = [ (0)x2, $sta_XI->[$i][43], $sta_XI->[$i][41], (0)x2, $sta_XI->[$i][36],
            $sta_XI->[$i][33], (0)x2, $sta_XI->[$i][27], $sta_XI->[$i][23], 0, $sta_XI->[$i][20],
            $sta_XI->[$i][16], $sta_XI->[$i][12], $sta_XI->[$i][9], $sta_XI->[$i][5], $sta_XI->[$i][1] ];
    $stb->[3][$i] = [$sta_XI->[$i][45], $sta_XI->[$i][44], $sta_XI->[$i][42], $sta_XI->[$i][40],
            $sta_XI->[$i][39], $sta_XI->[$i][38], $sta_XI->[$i][35], $sta_XI->[$i][32],
            $sta_XI->[$i][31], $sta_XI->[$i][30], $sta_XI->[$i][26], $sta_XI->[$i][22],
            $sta_XI->[$i][21], $sta_XI->[$i][19], $sta_XI->[$i][15], $sta_XI->[$i][11],
            $sta_XI->[$i][8], $sta_XI->[$i][4], $sta_XI->[$i][0] ];
}

for $j (0..3) {
    for $i (0..31) {
        print "stb[$j]\[5'd$i] = 19'b";
        for $k (0..18) {
            print $stb->[$j][$i][$k];
        }
        print ";\n"
    }
}

We could actually go even further; looking a little further ahead, stb is only used to fill in stf and stg:

    stf = { stb[18:15], stb[12:11], stb[8:7], stb[4:3], stb[0] };
    // Gated value to sum; bit 14 is indeed used twice
    if( phaselo_XI[0] )
        stg = { 2'b0, stb[14], stb[14:13], stb[10:9], stb[6:5], stb[2:1] };
    else
        stg = 11'd0;

Which means we could change our lookup table once more and directly read out stf and stg. However, scrolling down a little further in the same file, we see the same kind of pattern in the code doing the post-processing for jt51_exprom, so let’s tackle that one instead. Changing jt51_exprom to directly return etf and etg gets us: 5196/ 5280 (98%). Yay!

Now, if we wanted to make a drop-in replacement for an actual YM2151 chip, we’d have to serialize sound output. JT51 outputs xleft/xright/left/right using 16 IO pins each. (We don’t even have enough IO pins on our FPGA.) But the actual YM2151 uses four pins: clock, SH1, SH2, and SO. SO is the serialized representation of left/right, synced with clock. SH1 is high if SO is currently outputting left, SH2 is high is if SO is currently outputting right. In order to implement that, we need a few more LUTs.

Anyway, that was a rather long-winded explanation. Below is the code. I also have it on https://github.com/qiqitori/jt51. Note that the code hasn’t been tested yet at the time of this writing.

Revised jt51_phrom.v (still GPLv3 or later but the copyright header is a little too big for this space):

module jt51_phrom
(
	input [4:0] addr,
	input clk,
	input cen,
	input [1:0] phaselo_XI_76,
	output reg [18:0] ph,
);
	reg [18:0] stb[3:0][31:0];
	initial
	begin
		stb[0][5'd0] = 19'b0000000000000000001;
		stb[0][5'd1] = 19'b0000000000100000001;
		stb[0][5'd2] = 19'b0000000000000000001;
		stb[0][5'd3] = 19'b0000000000100000001;
		stb[0][5'd4] = 19'b0000000000100010001;
		stb[0][5'd5] = 19'b0000000000000010001;
		stb[0][5'd6] = 19'b0000000000100010001;
		stb[0][5'd7] = 19'b0000000000100000001;
		stb[0][5'd8] = 19'b0000000000000000001;
		stb[0][5'd9] = 19'b0000000000100000001;
		stb[0][5'd10] = 19'b0000000000100010001;
		stb[0][5'd11] = 19'b0000000000000010001;
		stb[0][5'd12] = 19'b0000000000000000000;
		stb[0][5'd13] = 19'b0000000000100000000;
		stb[0][5'd14] = 19'b0000000000110000000;
		stb[0][5'd15] = 19'b0000000000100010000;
		stb[0][5'd16] = 19'b0000000000000010000;
		stb[0][5'd17] = 19'b0000000000000000000;
		stb[0][5'd18] = 19'b0000000000000000000;
		stb[0][5'd19] = 19'b0000000000010010000;
		stb[0][5'd20] = 19'b0000000000000010000;
		stb[0][5'd21] = 19'b0000000000110010000;
		stb[0][5'd22] = 19'b0000000000110000001;
		stb[0][5'd23] = 19'b0000000000110000001;
		stb[0][5'd24] = 19'b0000000000010010001;
		stb[0][5'd25] = 19'b0000000000010000001;
		stb[0][5'd26] = 19'b0000000000010001001;
		stb[0][5'd27] = 19'b0000000000010011000;
		stb[0][5'd28] = 19'b0000000000110011000;
		stb[0][5'd29] = 19'b0000000000110000010;
		stb[0][5'd30] = 19'b0000000000010011011;
		stb[0][5'd31] = 19'b0000000000010010000;

		stb[1][5'd0] = 19'b0000000100100011100;
		stb[1][5'd1] = 19'b0000001000000001100;
		stb[1][5'd2] = 19'b0000001000000001100;
		stb[1][5'd3] = 19'b0000001000100000000;
		stb[1][5'd4] = 19'b0000001000100000000;
		stb[1][5'd5] = 19'b0000001100000001000;
		stb[1][5'd6] = 19'b0000001000000001000;
		stb[1][5'd7] = 19'b0000001100100001000;
		stb[1][5'd8] = 19'b0000001000100000100;
		stb[1][5'd9] = 19'b0000000000010010100;
		stb[1][5'd10] = 19'b0000000000110011100;
		stb[1][5'd11] = 19'b0000000000110011100;
		stb[1][5'd12] = 19'b0000000100010010000;
		stb[1][5'd13] = 19'b0000001100110011000;
		stb[1][5'd14] = 19'b0000001100110011000;
		stb[1][5'd15] = 19'b0000001100010000100;
		stb[1][5'd16] = 19'b0000000000110001100;
		stb[1][5'd17] = 19'b0000000000010001100;
		stb[1][5'd18] = 19'b0000000100010000000;
		stb[1][5'd19] = 19'b0000001100110001000;
		stb[1][5'd20] = 19'b0000001000010010100;
		stb[1][5'd21] = 19'b0000001000100011100;
		stb[1][5'd22] = 19'b0000001000000010001;
		stb[1][5'd23] = 19'b0000000000100011001;
		stb[1][5'd24] = 19'b0000000000010001101;
		stb[1][5'd25] = 19'b0000000000110000001;
		stb[1][5'd26] = 19'b0000001100000000101;
		stb[1][5'd27] = 19'b0000000100110000001;
		stb[1][5'd28] = 19'b0000000000000011101;
		stb[1][5'd29] = 19'b0000001000110011101;
		stb[1][5'd30] = 19'b0000000100010010001;
		stb[1][5'd31] = 19'b0000001100100010111;

		stb[2][5'd0] = 19'b0001001000000000000;
		stb[2][5'd1] = 19'b0000001100000000000;
		stb[2][5'd2] = 19'b0000001100010000000;
		stb[2][5'd3] = 19'b0001001100010000010;
		stb[2][5'd4] = 19'b0000001000000000010;
		stb[2][5'd5] = 19'b0001001000000000010;
		stb[2][5'd6] = 19'b0001001100000000010;
		stb[2][5'd7] = 19'b0011001000010001010;
		stb[2][5'd8] = 19'b0011001000010001010;
		stb[2][5'd9] = 19'b0010001100010001010;
		stb[2][5'd10] = 19'b0001001000010001010;
		stb[2][5'd11] = 19'b0011001100110001010;
		stb[2][5'd12] = 19'b0011001000110000101;
		stb[2][5'd13] = 19'b0000001100100000101;
		stb[2][5'd14] = 19'b0010000000100000101;
		stb[2][5'd15] = 19'b0001001000110000101;
		stb[2][5'd16] = 19'b0010001100100000101;
		stb[2][5'd17] = 19'b0001001100010101101;
		stb[2][5'd18] = 19'b0011001000000101111;
		stb[2][5'd19] = 19'b0000000000010101111;
		stb[2][5'd20] = 19'b0001001000110101111;
		stb[2][5'd21] = 19'b0010000100110101111;
		stb[2][5'd22] = 19'b0011000100100100001;
		stb[2][5'd23] = 19'b0001000100110100001;
		stb[2][5'd24] = 19'b0001000000000010001;
		stb[2][5'd25] = 19'b0001000000010011011;
		stb[2][5'd26] = 19'b0000000000100011011;
		stb[2][5'd27] = 19'b0001000100110011000;
		stb[2][5'd28] = 19'b0001000000100011000;
		stb[2][5'd29] = 19'b0000000100000110110;
		stb[2][5'd30] = 19'b0000000000010110110;
		stb[2][5'd31] = 19'b0010000100100110111;

		stb[3][5'd0] = 19'b0100100101101000010;
		stb[3][5'd1] = 19'b1000100000100101010;
		stb[3][5'd2] = 19'b0001100101110101010;
		stb[3][5'd3] = 19'b0101100000110001010;
		stb[3][5'd4] = 19'b1011100101100101010;
		stb[3][5'd5] = 19'b0111100000111001010;
		stb[3][5'd6] = 19'b0110100100111101010;
		stb[3][5'd7] = 19'b0011110001101101010;
		stb[3][5'd8] = 19'b1101100111111001010;
		stb[3][5'd9] = 19'b0101011011010101010;
		stb[3][5'd10] = 19'b0111100011000001010;
		stb[3][5'd11] = 19'b1101100101010101010;
		stb[3][5'd12] = 19'b1101011001001001010;
		stb[3][5'd13] = 19'b0111001000001001010;
		stb[3][5'd14] = 19'b0100100111011101010;
		stb[3][5'd15] = 19'b1011100110000000110;
		stb[3][5'd16] = 19'b1111110010110000110;
		stb[3][5'd17] = 19'b1001011000101100110;
		stb[3][5'd18] = 19'b1111011100111100110;
		stb[3][5'd19] = 19'b1000011111101000110;
		stb[3][5'd20] = 19'b1001100101110000110;
		stb[3][5'd21] = 19'b1110001001010010110;
		stb[3][5'd22] = 19'b1100011010001110100;
		stb[3][5'd23] = 19'b1111011010111110100;
		stb[3][5'd24] = 19'b1010001001010011100;
		stb[3][5'd25] = 19'b0100011010100111100;
		stb[3][5'd26] = 19'b1101001000011101100;
		stb[3][5'd27] = 19'b0110011000001101101;
		stb[3][5'd28] = 19'b1011001010110111101;
		stb[3][5'd29] = 19'b0001011001000001101;
		stb[3][5'd30] = 19'b1001011010101011101;
		stb[3][5'd31] = 19'b1101011000111011101;
	end

	always @(posedge clk)
		addr_latched <= addr;

	always @(*)
		ph <= stb[phaselo_XI_76][clk ? addr : addr_latched]; // addr_latched might be stale on clk edge
endmodule

Update 2023/03/06: fixed inaccurate conversion. Bold lines mark changes from last time.

In jt_op.v, replace the original jt51_phrom “call” as follows:

jt51_phrom u_phrom(
    .clk    ( clk       ),
    .cen    ( cen       ),
    .addr   ( aux_X[5:1]),
    .phaselo_XI_76 ( phaselo_XI[7:6] ),
    .ph     ( stb    )
);

And jt51_exprom.v, also GPL 3 or later but header removed for brevity.

module jt51_exprom
(
    input [4:0]         addr,
    input               clk,
    input               cen,
    input [1:0]         totalatten_XII_76,
    output reg [9:0]        etf,
    output reg [2:0]        etg
);
    reg [9:0] explut_etf[31:0];
    reg [2:0] explut_etg[31:0];
    initial
    begin
        explut_etf[0][5'd0] = 10'b1110110111;
        explut_etf[0][5'd1] = 10'b1110101011;
        explut_etf[0][5'd2] = 10'b1110011101;
        explut_etf[0][5'd3] = 10'b1110000101;
        explut_etf[0][5'd4] = 10'b1110100001;
        explut_etf[0][5'd5] = 10'b1110110110;
        explut_etf[0][5'd6] = 10'b1001001010;
        explut_etf[0][5'd7] = 10'b1110011100;
        explut_etf[0][5'd8] = 10'b1110000100;
        explut_etf[0][5'd9] = 10'b1110010000;
        explut_etf[0][5'd10] = 10'b1110110111;
        explut_etf[0][5'd11] = 10'b1110101011;
        explut_etf[0][5'd12] = 10'b1110111101;
        explut_etf[0][5'd13] = 10'b1110100101;
        explut_etf[0][5'd14] = 10'b1110110001;
        explut_etf[0][5'd15] = 10'b1110101110;
        explut_etf[0][5'd16] = 10'b1110111010;
        explut_etf[0][5'd17] = 10'b1110100010;
        explut_etf[0][5'd18] = 10'b1110110100;
        explut_etf[0][5'd19] = 10'b1110101000;
        explut_etf[0][5'd20] = 10'b1110111111;
        explut_etf[0][5'd21] = 10'b1110100111;
        explut_etf[0][5'd22] = 10'b1110110011;
        explut_etf[0][5'd23] = 10'b1110101101;
        explut_etf[0][5'd24] = 10'b1010000101;
        explut_etf[0][5'd25] = 10'b1110010001;
        explut_etf[0][5'd26] = 10'b1110001110;
        explut_etf[0][5'd27] = 10'b1110011010;
        explut_etf[0][5'd28] = 10'b1110100010;
        explut_etf[0][5'd29] = 10'b1110110100;
        explut_etf[0][5'd30] = 10'b1010011000;
        explut_etf[0][5'd31] = 10'b1110000000;
        explut_etf[1][5'd0] = 10'b0010000010;
        explut_etf[1][5'd1] = 10'b1101001100;
        explut_etf[1][5'd2] = 10'b0011000100;
        explut_etf[1][5'd3] = 10'b0011001000;
        explut_etf[1][5'd4] = 10'b0011000000;
        explut_etf[1][5'd5] = 10'b0010101111;
        explut_etf[1][5'd6] = 10'b0010100111;
        explut_etf[1][5'd7] = 10'b0011101011;
        explut_etf[1][5'd8] = 10'b0011100011;
        explut_etf[1][5'd9] = 10'b0010011101;
        explut_etf[1][5'd10] = 10'b0010010101;
        explut_etf[1][5'd11] = 10'b0011011001;
        explut_etf[1][5'd12] = 10'b1100110001;
        explut_etf[1][5'd13] = 10'b0010111110;
        explut_etf[1][5'd14] = 10'b0011110110;
        explut_etf[1][5'd15] = 10'b0010000110;
        explut_etf[1][5'd16] = 10'b1101001010;
        explut_etf[1][5'd17] = 10'b1100100010;
        explut_etf[1][5'd18] = 10'b1101101100;
        explut_etf[1][5'd19] = 10'b1100010100;
        explut_etf[1][5'd20] = 10'b0010011000;
        explut_etf[1][5'd21] = 10'b1100110000;
        explut_etf[1][5'd22] = 10'b1101111111;
        explut_etf[1][5'd23] = 10'b1100001111;
        explut_etf[1][5'd24] = 10'b1101000111;
        explut_etf[1][5'd25] = 10'b1100101011;
        explut_etf[1][5'd26] = 10'b0011100011;
        explut_etf[1][5'd27] = 10'b0010011101;
        explut_etf[1][5'd28] = 10'b1100110101;
        explut_etf[1][5'd29] = 10'b1101111001;
        explut_etf[1][5'd30] = 10'b0010001001;
        explut_etf[1][5'd31] = 10'b1100100001;
        explut_etf[2][5'd0] = 10'b0101000110;
        explut_etf[2][5'd1] = 10'b0100001010;
        explut_etf[2][5'd2] = 10'b0110111100;
        explut_etf[2][5'd3] = 10'b0111010100;
        explut_etf[2][5'd4] = 10'b0110011000;
        explut_etf[2][5'd5] = 10'b0111100000;
        explut_etf[2][5'd6] = 10'b0110101111;
        explut_etf[2][5'd7] = 10'b0111000111;
        explut_etf[2][5'd8] = 10'b0110001011;
        explut_etf[2][5'd9] = 10'b0111111101;
        explut_etf[2][5'd10] = 10'b0101110101;
        explut_etf[2][5'd11] = 10'b0100111001;
        explut_etf[2][5'd12] = 10'b0101010001;
        explut_etf[2][5'd13] = 10'b0110011110;
        explut_etf[2][5'd14] = 10'b0100010110;
        explut_etf[2][5'd15] = 10'b0111101010;
        explut_etf[2][5'd16] = 10'b0101100010;
        explut_etf[2][5'd17] = 10'b0100101100;
        explut_etf[2][5'd18] = 10'b0100100100;
        explut_etf[2][5'd19] = 10'b0111001000;
        explut_etf[2][5'd20] = 10'b0101000000;
        explut_etf[2][5'd21] = 10'b0101001111;
        explut_etf[2][5'd22] = 10'b0010000111;
        explut_etf[2][5'd23] = 10'b0000001011;
        explut_etf[2][5'd24] = 10'b0000000011;
        explut_etf[2][5'd25] = 10'b0000001101;
        explut_etf[2][5'd26] = 10'b0000000101;
        explut_etf[2][5'd27] = 10'b0000001001;
        explut_etf[2][5'd28] = 10'b0000000001;
        explut_etf[2][5'd29] = 10'b0000001110;
        explut_etf[2][5'd30] = 10'b0000000110;
        explut_etf[2][5'd31] = 10'b0000001010;
        explut_etf[3][5'd0] = 10'b0010101011;
        explut_etf[3][5'd1] = 10'b0010010101;
        explut_etf[3][5'd2] = 10'b0010111110;
        explut_etf[3][5'd3] = 10'b0001001010;
        explut_etf[3][5'd4] = 10'b0001100100;
        explut_etf[3][5'd5] = 10'b0010111111;
        explut_etf[3][5'd6] = 10'b0010001011;
        explut_etf[3][5'd7] = 10'b0010100101;
        explut_etf[3][5'd8] = 10'b0010111110;
        explut_etf[3][5'd9] = 10'b0010001010;
        explut_etf[3][5'd10] = 10'b0010010100;
        explut_etf[3][5'd11] = 10'b0010111111;
        explut_etf[3][5'd12] = 10'b0010101011;
        explut_etf[3][5'd13] = 10'b0001010101;
        explut_etf[3][5'd14] = 10'b0010000001;
        explut_etf[3][5'd15] = 10'b0010011010;
        explut_etf[3][5'd16] = 10'b0010001100;
        explut_etf[3][5'd17] = 10'b0010010000;
        explut_etf[3][5'd18] = 10'b0010000111;
        explut_etf[3][5'd19] = 10'b0010011101;
        explut_etf[3][5'd20] = 10'b0010001001;
        explut_etf[3][5'd21] = 10'b0010010110;
        explut_etf[3][5'd22] = 10'b0010000010;
        explut_etf[3][5'd23] = 10'b0010011000;
        explut_etf[3][5'd24] = 10'b0010101111;
        explut_etf[3][5'd25] = 10'b0010110011;
        explut_etf[3][5'd26] = 10'b0010100101;
        explut_etf[3][5'd27] = 10'b0010000001;
        explut_etf[3][5'd28] = 10'b0010011010;
        explut_etf[3][5'd29] = 10'b0010101100;
        explut_etf[3][5'd30] = 10'b0000001000;
        explut_etf[3][5'd31] = 10'b0010010111;
        explut_etg[0][5'd0] = 3'b101;
        explut_etg[0][5'd1] = 3'b101;
        explut_etg[0][5'd2] = 3'b101;
        explut_etg[0][5'd3] = 3'b101;
        explut_etg[0][5'd4] = 3'b101;
        explut_etg[0][5'd5] = 3'b101;
        explut_etg[0][5'd6] = 3'b101;
        explut_etg[0][5'd7] = 3'b101;
        explut_etg[0][5'd8] = 3'b101;
        explut_etg[0][5'd9] = 3'b101;
        explut_etg[0][5'd10] = 3'b110;
        explut_etg[0][5'd11] = 3'b110;
        explut_etg[0][5'd12] = 3'b110;
        explut_etg[0][5'd13] = 3'b110;
        explut_etg[0][5'd14] = 3'b110;
        explut_etg[0][5'd15] = 3'b110;
        explut_etg[0][5'd16] = 3'b110;
        explut_etg[0][5'd17] = 3'b110;
        explut_etg[0][5'd18] = 3'b110;
        explut_etg[0][5'd19] = 3'b110;
        explut_etg[0][5'd20] = 3'b100;
        explut_etg[0][5'd21] = 3'b100;
        explut_etg[0][5'd22] = 3'b100;
        explut_etg[0][5'd23] = 3'b100;
        explut_etg[0][5'd24] = 3'b100;
        explut_etg[0][5'd25] = 3'b100;
        explut_etg[0][5'd26] = 3'b100;
        explut_etg[0][5'd27] = 3'b100;
        explut_etg[0][5'd28] = 3'b100;
        explut_etg[0][5'd29] = 3'b100;
        explut_etg[0][5'd30] = 3'b100;
        explut_etg[0][5'd31] = 3'b100;
        explut_etg[1][5'd0] = 3'b101;
        explut_etg[1][5'd1] = 3'b101;
        explut_etg[1][5'd2] = 3'b101;
        explut_etg[1][5'd3] = 3'b101;
        explut_etg[1][5'd4] = 3'b101;
        explut_etg[1][5'd5] = 3'b100;
        explut_etg[1][5'd6] = 3'b100;
        explut_etg[1][5'd7] = 3'b100;
        explut_etg[1][5'd8] = 3'b100;
        explut_etg[1][5'd9] = 3'b100;
        explut_etg[1][5'd10] = 3'b100;
        explut_etg[1][5'd11] = 3'b100;
        explut_etg[1][5'd12] = 3'b100;
        explut_etg[1][5'd13] = 3'b100;
        explut_etg[1][5'd14] = 3'b100;
        explut_etg[1][5'd15] = 3'b100;
        explut_etg[1][5'd16] = 3'b100;
        explut_etg[1][5'd17] = 3'b100;
        explut_etg[1][5'd18] = 3'b100;
        explut_etg[1][5'd19] = 3'b100;
        explut_etg[1][5'd20] = 3'b100;
        explut_etg[1][5'd21] = 3'b100;
        explut_etg[1][5'd22] = 3'b101;
        explut_etg[1][5'd23] = 3'b101;
        explut_etg[1][5'd24] = 3'b101;
        explut_etg[1][5'd25] = 3'b101;
        explut_etg[1][5'd26] = 3'b101;
        explut_etg[1][5'd27] = 3'b101;
        explut_etg[1][5'd28] = 3'b101;
        explut_etg[1][5'd29] = 3'b101;
        explut_etg[1][5'd30] = 3'b101;
        explut_etg[1][5'd31] = 3'b101;
        explut_etg[2][5'd0] = 3'b101;
        explut_etg[2][5'd1] = 3'b101;
        explut_etg[2][5'd2] = 3'b101;
        explut_etg[2][5'd3] = 3'b101;
        explut_etg[2][5'd4] = 3'b101;
        explut_etg[2][5'd5] = 3'b101;
        explut_etg[2][5'd6] = 3'b001;
        explut_etg[2][5'd7] = 3'b001;
        explut_etg[2][5'd8] = 3'b001;
        explut_etg[2][5'd9] = 3'b001;
        explut_etg[2][5'd10] = 3'b001;
        explut_etg[2][5'd11] = 3'b001;
        explut_etg[2][5'd12] = 3'b001;
        explut_etg[2][5'd13] = 3'b001;
        explut_etg[2][5'd14] = 3'b001;
        explut_etg[2][5'd15] = 3'b001;
        explut_etg[2][5'd16] = 3'b001;
        explut_etg[2][5'd17] = 3'b001;
        explut_etg[2][5'd18] = 3'b001;
        explut_etg[2][5'd19] = 3'b001;
        explut_etg[2][5'd20] = 3'b001;
        explut_etg[2][5'd21] = 3'b110;
        explut_etg[2][5'd22] = 3'b110;
        explut_etg[2][5'd23] = 3'b110;
        explut_etg[2][5'd24] = 3'b110;
        explut_etg[2][5'd25] = 3'b110;
        explut_etg[2][5'd26] = 3'b110;
        explut_etg[2][5'd27] = 3'b110;
        explut_etg[2][5'd28] = 3'b110;
        explut_etg[2][5'd29] = 3'b110;
        explut_etg[2][5'd30] = 3'b110;
        explut_etg[2][5'd31] = 3'b110;
        explut_etg[3][5'd0] = 3'b111;
        explut_etg[3][5'd1] = 3'b111;
        explut_etg[3][5'd2] = 3'b111;
        explut_etg[3][5'd3] = 3'b111;
        explut_etg[3][5'd4] = 3'b111;
        explut_etg[3][5'd5] = 3'b011;
        explut_etg[3][5'd6] = 3'b011;
        explut_etg[3][5'd7] = 3'b011;
        explut_etg[3][5'd8] = 3'b011;
        explut_etg[3][5'd9] = 3'b011;
        explut_etg[3][5'd10] = 3'b011;
        explut_etg[3][5'd11] = 3'b101;
        explut_etg[3][5'd12] = 3'b101;
        explut_etg[3][5'd13] = 3'b101;
        explut_etg[3][5'd14] = 3'b101;
        explut_etg[3][5'd15] = 3'b101;
        explut_etg[3][5'd16] = 3'b101;
        explut_etg[3][5'd17] = 3'b101;
        explut_etg[3][5'd18] = 3'b001;
        explut_etg[3][5'd19] = 3'b001;
        explut_etg[3][5'd20] = 3'b001;
        explut_etg[3][5'd21] = 3'b001;
        explut_etg[3][5'd22] = 3'b001;
        explut_etg[3][5'd23] = 3'b001;
        explut_etg[3][5'd24] = 3'b110;
        explut_etg[3][5'd25] = 3'b110;
        explut_etg[3][5'd26] = 3'b110;
        explut_etg[3][5'd27] = 3'b110;
        explut_etg[3][5'd28] = 3'b110;
        explut_etg[3][5'd29] = 3'b110;
        explut_etg[3][5'd30] = 3'b110;
        explut_etg[3][5'd31] = 3'b010;
    end

//    always @ (posedge clk) if(cen) begin
//        etf <= explut_etf[totalatten_XII_76][addr];
//        etg <= explut_etg[totalatten_XII_76][addr];
//    end
    always @ (posedge clk) // only update addr on clock edge
        addr_latched <= addr;

    always @(*) begin // allow etf and etg updates whenever totalatten_XII_76 changes
        etf <= explut_etf[totalatten_XII_76][clk ? addr : addr_latched]; // addr_latched might be stale on clk edge
        etg <= explut_etg[totalatten_XII_76][clk ? addr : addr_latched]; // addr_latched might be stale on clk edge
    end

endmodule

Update 2023/03/06: bold lines. Need to keep always conditions intact.

And the original jt51_exprom call as follows:

reg  [ 9:0] etf;
reg  [ 2:0] etg;

jt51_exprom u_exprom(
    .clk    ( clk           ),
    .cen    ( cen           ),
    .addr   ( atten_internal_XI[5:1] ),
    .totalatten_XII_76 ( totalatten_XII[7:6] ),
    .etf    ( etf ),
    .etg    ( etg )
);

And the Perl script to generate the exprom table:

#!/usr/bin/perl

use Data::Dumper;

$exp_XII = [ [ 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1 ],
    [ 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1 ],
    [ 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1 ],
    [ 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1 ],
    [ 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1 ],
    [ 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1 ],
    [ 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0 ],
    [ 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1 ],
    [ 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1 ],
    [ 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1 ],
    [ 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1 ],
    [ 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1 ],
    [ 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1 ],
    [ 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1 ],
    [ 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1 ],
    [ 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1 ],
    [ 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1 ],
    [ 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1 ],
    [ 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1 ],
    [ 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1 ],
    [ 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1 ],
    [ 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1 ],
    [ 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1 ],
    [ 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1 ],
    [ 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0 ],
    [ 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1 ],
    [ 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1 ],
    [ 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1 ],
    [ 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1 ],
    [ 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1 ],
    [ 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0 ],
    [ 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1 ] ];

for $i (0..31) {
    $etf->[0][$i] = [ 1, reverse(@{$exp_XII->[$i]}[36..44]) ];
    $etg->[0][$i] = [ 1, reverse(@{$exp_XII->[$i]}[34..35]) ];

    $etf->[1][$i] = [ reverse(@{$exp_XII->[$i]}[24..33]) ];
    $etg->[1][$i] = [ 1, 0, @{$exp_XII->[$i]}[23] ];

    $etf->[2][$i] = [ 0, reverse(@{$exp_XII->[$i]}[14..22]) ];
    $etg->[2][$i] = [ reverse(@{$exp_XII->[$i]}[11..13]) ];

    $etf->[3][$i] = [ 0, 0, reverse(@{$exp_XII->[$i]}[3..10]) ];
    $etg->[3][$i] = [ reverse(@{$exp_XII->[$i]}[0..2]) ];
}
# original code:
# 2'b00: begin
#         etf = { 1'b1, exp_XII[44:36]  };
#         etg = { 1'b1, exp_XII[35:34] };             
#     end
# 2'b01: begin
#         etf = exp_XII[33:24];
#         etg = { 2'b10, exp_XII[23] };               
#     end
# 2'b10: begin
#         etf = { 1'b0, exp_XII[22:14]  };
#         etg = exp_XII[13:11];               
#     end
# 2'b11: begin
#         etf = { 2'b00, exp_XII[10:3]  };
#         etg = exp_XII[2:0];
#     end

for $j (0..3) {
    for $i (0..31) {
        print "etf[$j]\[5'd$i] = 10'b";
        for $k (0..9) {
            print $etf->[$j][$i][$k];
        }
        print ";\n"
    }
}
for $j (0..3) {
    for $i (0..31) {
        print "etg[$j]\[5'd$i] = 3'b";
        for $k (0..2) {
            print $etg->[$j][$i][$k];
        }
        print ";\n"
    }
}

The exprom code used [44:36], so we need to reverse that using Perl’s array-reversing function, reverse(). The notation used here (reverse(@{$exp_XII->[$i]}[36..44])) is probably one of the reasons why Perl has fallen out of favor. :)

Two Raspberry Pi Picos pretending to be a Z80

Last year, I bought a faulty Hitachi MB-H2 (MSX) in order to gain electronics and repair experience. Using my oscilloscope and two simple 74-series (NOT and AND) logic ICs, I managed to figure out that one of the RAM chips was faulty. I replaced the RAM chip, but it still wouldn’t work. I did one more slightly less reliable oscilloscope-based test and replaced one more chip, and it still wouldn’t work. How many faults can this machine have? Well it turns out that probably only the first RAM chip was broken in the first place, and I just didn’t solder properly. I thought I had checked my connections, but I guess one was border-line. (I have more soldering experience now.)

So, suspecting that I had some kind of severe fault, and not having come up with the logic analyzer “idea” yet, and noticing that the CPU was socketed, I decided to take out the CPU and just generate the signals that the CPU would generate myself, using two Raspberry Pi Picos. (Because I needed a lot of pins, not necessarily performance.) One Pico is responsible for the address and control pins, the other for the data pins. As I noticed some time in, Pico 1 should have had the data and control pins, Pico 2 the address pins. Why? Timing matters with the data pins, but for address pins you can be super slow and it’ll be fine. It still worked out in the end.

Pico 1 controls Pico 2 via UART. A host computer (yes, you, Mr. ThinkPad) controls Pico 1 via serial. Then some idiot (yes, me) types in commands into a serial terminal, and Pico 1 does the idiot’s bidding. The following commands are recognized:

  • i, for IOREQ input
  • o, for IOREQ output
  • v, for VRAM manipulation, which I actually couldn’t get to work the way I expected, but it still does something
  • r, for RAM reads
  • w, for RAM writes (and a simple readback to make sure the RAM stores stuff)
  • W, for RAM writes with RAM refresh (and a readback after every refresh). You can specify the amount of writes between refreshes and stuff.
  • s, sync UART (flushes out all characters stuck in the UART read buffer)
  • 0: ask Pico 2 to set data bus to 0
  • u: ask Pico 2 to unset data bus (i.e., to set bus direction from: GPIO_OUT to: GPIO_IN)

So, how does it work? How does the Z80 work? Let’s have a look at the Z80 pinout:

The A pins are the address pins. The D pins are the data pins. So if you want to write 0 to address 0, all those pins will be 0. In addition, RD will be high, and WR will be low, because we are writing. (Yes, 0 means “active” and 1 means “inactive”.) In addition, we are writing to memory, not to IO space. So IORQ is 1 and MREQ is 0. (Also M1 goes from 1 to 0 too, but I don’t remember the details there.) If we instead want to talk to hardware, we need to know the hardware address and set IORQ to 0 instead of MREQ. On the Z80, only A0 to A7 matter for IO addresses. Well, that’s the gist of it.

With a crude thing like this, we can:

  • Dump the main ROM and check that the contents are correct
  • Check if memory works
  • Check if the sound chip works
  • Check if the video chip works
  • Check if the IO controller works
    • We can turn the tape motor on and off
    • We can map memory
    • Etc.?
  • (Provided the connection from CPU to the above peripherals is working)

I was able to check all of the above. Note that in the highly unlikely event that you decide to run any of this on your MSX machine, note that memory mapping is a bit different from machine to machine. (Which is important, otherwise RAM expansions wouldn’t work, right?)

So here are some examples of commands I’d paste into my terminal:

# Turn off tape motor (which is on by default IIRC?), map memory, maybe some other stuff (I got these by running the MB-H2 in openmsx and checking the earliest 'in's and 'out's in openmsx-debugger, also see below screenshot)
# Execute this before executing anything else!
o00ab82o00aa50o00a800o00a850o00a8a0o00a8f0

# Read first 16 bytes of ROM
r0000rr0001rr0002rr0003rr0004rr0005rr0006rr0007rr0008rr0009rr000arr000brr000crr000drr000err000frr0010r

# Read bits 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, and 15 (counting bits from 0) (if you get the right result here you'll know that your connections are good)
r0001rr0002rr0004rr0008rr0010rr0020rr0040rr0080rr0100rr0200rr0400rr0800rr1000rr2000rr4000rr8000r

# Expected result of above command for MB-H2:
sed -n -e '2p' -e '3p' -e '5p' -e '9p' -e '17p' -e '33p' -e '65p' -e '129p' -e '257p' -e '513p' -e '1025p' -e '2049p' -e '4097p' -e '8193p' -e '16385p' 32k_v2.hex
c3
d7
bf
c3
c3
c3
13
06
ee
2a
e5
32
a4
00
e5
head -n 1 16k_v2.hex
41

# Set background colors:
o00990fo009987 # white background
o00990eo009987 # gray
o00990do009987 # magenta
o00990co009987 # dark green
o00990bo009987 # light yellow
o00990ao009987 # dark yellow
o009909o009987 # light red
o009908o009987 # medium red
o009907o009987 # cyan
o009906o009987 # dark red
o009905o009987 # light blue
o009904o009987 # dark blue
o009903o009987 # light green
o009902o009987 # medium green
o009901o009987 # black

# click sound test:
o00ab0fo00ab0eo00ab0fo00ab0eo00ab0fo00ab0e

# VRAM notes (worked partially, but I think you may need to change the video mode to something else in order to get full VRAM access like this? Never really got it to work as expected. IIRC, the values would stick for a bit, and then go back to 0f or something)
# To read from 0000 to ... (address is auto-incremented, so you only have to set it once):
o009900 # set lower byte of address
o009900 # set upper byte of address (bit 7 and 6 are low to indicate that we want to read)
i0098 # read from 0000
i0098 # read from 0001
i0098 # read from 0002
i0098 # read from 0003
...
# bunched into a single line:
o009900o009900i0098i0098i0098i0098
# To write ff to 0000-... (address is auto-incremented, so you only have to set it once):
o009900 # set lower byte of address
o009940 # set upper byte of address (bit 7 is low and bit 6 is high to indicate that we want to write)
o0098ff # set data register to ff to write to 0000
o0098ff # set data register to ff to write to 0001
o0098ff # set data register to ff to write to 0002
o0098ff # set data register to ff to write to 0003
...
# bunched into a single line:
o009900o009940o0098ffo0098ffo0098ff

What’s a good story without pictures?

openmsx-debugger early in the boot process (looking for RAM)
Serial terminal for Pico 1 (left) and Pico 2 (right) (Pico 2’s output is just debug output, and doesn’t accept commands from this serial connection)
I put a 40-pin socket in the existing 40-pin socket, and directly clipped in breadboard jumper wire. It worked, but wasn’t fun. I still stuck with this setup. Note: maybe half the wires pictured here are actually part of the computer. Yes, this computer has many, many wires inside.
BTW, just one Pico here because I thought it’d be worth something with just a couple bits on the data bus. But yeah, that wasn’t very fun.
Now running with two Raspberry Pi Picos. At first I tried using an automatic level shifter, but that didn’t work. Possibly because the data bus’ idle voltage (when everything is at high impedance) is at around 2V. I think I even saw a datasheet somewhere recommending doing that, maybe the 4164 RAM?
And here’s a tiny minicom script cycling through the background colors (see below)
sleep 5
send o009900o009980\c
send o009903o009981\c
send o009900o009987\c
sleep 1
send o009901o009987\c
sleep 1
send o009902o009987\c
sleep 1
send o009903o009987\c
sleep 1
send o009904o009987\c
sleep 1
send o009905o009987\c
sleep 1
send o009906o009987\c
sleep 1
send o009907o009987\c
sleep 1
send o009908o009987\c
sleep 1
send o009909o009987\c
sleep 1
send o00990ao009987\c
sleep 1
send o00990bo009987\c
sleep 1
send o00990co009987\c
sleep 1
send o00990do009987\c
sleep 1
send o00990eo009987\c
sleep 1
send o00990fo009987\c

The code

Danger: do not submit code to code beauty contests.

The code consists of two separate projects, in CMake terms. One is for Pico 1, the other is for Pico 2. First of all the CMakeLists.txt files are as follows:

Pico 1 (create a directory called e.g. inspect_system_interactively and in there, create a file called CMakeLists.txt with the following contents):

cmake_minimum_required(VERSION 3.12)

# Pull in SDK (must be before project)
include(pico_sdk_import.cmake)

project(pico_examples C CXX ASM)
set(CMAKE_C_STANDARD 11)
set(CMAKE_CXX_STANDARD 17)

if (PICO_SDK_VERSION_STRING VERSION_LESS "1.3.0")
    message(FATAL_ERROR "Raspberry Pi Pico SDK version 1.3.0 (or later) required. Your version is ${PICO_SDK_VERSION_STRING}")
endif()

set(PICO_EXAMPLES_PATH ${PROJECT_SOURCE_DIR})

# Initialize the SDK
pico_sdk_init()

include(example_auto_set_url.cmake)

add_compile_options(-Wall -Wextra
        -Wno-format          # int != int32_t as far as the compiler is concerned because gcc has int32_t as long int
        -Wno-unused-function # we have some for the docs that aren't called
        -Wno-maybe-uninitialized
        -O3
        )

add_executable(inspect_system_interactively
        inspect_system_interactively.c
        )

# pull in common dependencies
target_link_libraries(inspect_system_interactively pico_stdlib)

# enable usb output, disable uart output
pico_enable_stdio_usb(inspect_system_interactively 1)
pico_enable_stdio_uart(inspect_system_interactively 0)

# create map/bin/hex file etc.
pico_add_extra_outputs(inspect_system_interactively)

# add url via pico_set_program_url
example_auto_set_url(inspect_system_interactively)

Pico 2 (my directory name is inspect_system_interactively_databus):

cmake_minimum_required(VERSION 3.12)

# Pull in SDK (must be before project)
include(pico_sdk_import.cmake)

project(pico_examples C CXX ASM)
set(CMAKE_C_STANDARD 11)
set(CMAKE_CXX_STANDARD 17)

if (PICO_SDK_VERSION_STRING VERSION_LESS "1.3.0")
    message(FATAL_ERROR "Raspberry Pi Pico SDK version 1.3.0 (or later) required. Your version is ${PICO_SDK_VERSION_STRING}")
endif()

set(PICO_EXAMPLES_PATH ${PROJECT_SOURCE_DIR})

# Initialize the SDK
pico_sdk_init()

include(example_auto_set_url.cmake)

add_compile_options(-Wall -Wextra
        -Wno-format          # int != int32_t as far as the compiler is concerned because gcc has int32_t as long int
        -Wno-unused-function # we have some for the docs that aren't called
        -Wno-maybe-uninitialized
        -O3
        )

add_executable(inspect_system_interactively_databus
        inspect_system_interactively_databus.c
        )

# pull in common dependencies
target_link_libraries(inspect_system_interactively_databus pico_stdlib)

# enable usb output, disable uart output
pico_enable_stdio_usb(inspect_system_interactively_databus 1)
pico_enable_stdio_uart(inspect_system_interactively_databus 0)

# create map/bin/hex file etc.
pico_add_extra_outputs(inspect_system_interactively_databus)

# add url via pico_set_program_url
example_auto_set_url(inspect_system_interactively_databus)

You can get the referenced pico_sdk_import.cmake and example_auto_set_url.cmake files from the pico-examples repository (https://github.com/raspberrypi/pico-examples.git).

Next, you need to place the C files into the corresponding directories. Then you just need to execute two commands, “cmake .” followed by “make”, in both directories.

inspect_system_interactively/inspect_system_interactively.c:

#include <stdio.h>
#include "pico/stdlib.h"

#define A11          0
#define A12          1
#define A13          2
#define A14          3
#define A15          4

#define A10         28
#define A9          27
#define A8          26
#define A7          22
#define A6          21
#define A5          20
#define A4          19
#define A3          18
#define A2          17
#define A1          15
#define A0          14

#define MREQ        16 /* active-low */
#define M1          13 /* active-low, we should probably set this and un-set when we're done reading */
#define RFSH        12 /* active-low and we won't be refreshing at all; tie to +5; could also feed inverted MREQ_RD if things don't work otherwise <-- chose to do this */
#define RD           5 /* active-low */
#define WR           6 /* active-low */
#define IOREQ        7 /* active-low */
/* #define HALT        active-low but shouldn't be a problem to keep this floating */

#define UART1_TX     8
#define UART1_RX     9
// #define BAUD    345600
#define BAUD    460800

#define READ_BACK_SIGNAL 10
#define READ_BACK_SIGNAL_ACK 11

#define HIGH         1
#define LOW          0

#ifndef PICO_DEFAULT_LED_PIN
#error blink requires a board with a regular LED
#endif

#define STATUS_LED PICO_DEFAULT_LED_PIN
// #define STATUS_LED 15

#define BUS_SIZE 16

// #define SLOW_PERF 1

const unsigned int a_bus[BUS_SIZE] = {
  A0, A1, A2, A3, A4, A5, A6, A7, A8, A9, A10, A11, A12, A13, A14, A15
};

#define println printf
// #define printf(...) ;

inline void nop(void) {
    __asm__ __volatile__("nop\t\n");
}

void setup() {
    gpio_init(A0);
    gpio_init(A1);
    gpio_init(A2);
    gpio_init(A3);
    gpio_init(A4);
    gpio_init(A5);
    gpio_init(A6);
    gpio_init(A7);
    gpio_init(A8);
    gpio_init(A9);
    gpio_init(A10);
    gpio_init(A11);
    gpio_init(A12);
    gpio_init(A13);
    gpio_init(A14);
    gpio_init(A15);
    gpio_init(MREQ);
    gpio_init(IOREQ);
    gpio_init(RD);
    gpio_init(WR);
    gpio_init(M1);
    gpio_init(RFSH);
    gpio_init(STATUS_LED);
    gpio_init(READ_BACK_SIGNAL);
    gpio_init(READ_BACK_SIGNAL_ACK);

    /* Not 100% sure if the default is LOW for all pins so let's set this early and once more after gpio_set_dir to be 100% sure */
    gpio_put(A0, LOW);
    gpio_put(A1, LOW);
    gpio_put(A2, LOW);
    gpio_put(A3, LOW);
    gpio_put(A4, LOW);
    gpio_put(A5, LOW);
    gpio_put(A6, LOW);
    gpio_put(A7, LOW);
    gpio_put(A8, LOW);
    gpio_put(A9, LOW);
    gpio_put(A10, LOW);
    gpio_put(A11, LOW);
    gpio_put(A12, LOW);
    gpio_put(A13, LOW);
    gpio_put(A14, LOW);
    gpio_put(A15, LOW);
    gpio_put(MREQ, HIGH);
    gpio_put(IOREQ, HIGH);
    gpio_put(RD, HIGH);
    gpio_put(WR, HIGH);
    gpio_put(M1, HIGH);
    gpio_put(RFSH, HIGH);
    gpio_put(READ_BACK_SIGNAL, LOW);

    gpio_set_dir(A0, GPIO_OUT);
    gpio_set_dir(A1, GPIO_OUT);
    gpio_set_dir(A2, GPIO_OUT);
    gpio_set_dir(A3, GPIO_OUT);
    gpio_set_dir(A4, GPIO_OUT);
    gpio_set_dir(A5, GPIO_OUT);
    gpio_set_dir(A6, GPIO_OUT);
    gpio_set_dir(A7, GPIO_OUT);
    gpio_set_dir(A8, GPIO_OUT);
    gpio_set_dir(A9, GPIO_OUT);
    gpio_set_dir(A10, GPIO_OUT);
    gpio_set_dir(A11, GPIO_OUT);
    gpio_set_dir(A12, GPIO_OUT);
    gpio_set_dir(A13, GPIO_OUT);
    gpio_set_dir(A14, GPIO_OUT);
    gpio_set_dir(A15, GPIO_OUT);
    gpio_set_dir(MREQ, GPIO_OUT);
    gpio_set_dir(IOREQ, GPIO_OUT);
    gpio_set_dir(RD, GPIO_OUT);
    gpio_set_dir(WR, GPIO_OUT);
    gpio_set_dir(M1, GPIO_OUT);
    gpio_set_dir(RFSH, GPIO_OUT);
    gpio_set_dir(STATUS_LED, GPIO_OUT);
    gpio_set_dir(READ_BACK_SIGNAL, GPIO_OUT);
    gpio_set_dir(READ_BACK_SIGNAL_ACK, GPIO_IN);

    gpio_put(A0, LOW);
    gpio_put(A1, LOW);
    gpio_put(A2, LOW);
    gpio_put(A3, LOW);
    gpio_put(A4, LOW);
    gpio_put(A5, LOW);
    gpio_put(A6, LOW);
    gpio_put(A7, LOW);
    gpio_put(A8, LOW);
    gpio_put(A9, LOW);
    gpio_put(A10, LOW);
    gpio_put(A11, LOW);
    gpio_put(A12, LOW);
    gpio_put(A13, LOW);
    gpio_put(A14, LOW);
    gpio_put(A15, LOW);
    gpio_put(MREQ, HIGH);
    gpio_put(IOREQ, HIGH);
    gpio_put(RD, HIGH);
    gpio_put(WR, HIGH);
    gpio_put(M1, HIGH);
    gpio_put(RFSH, HIGH);
    gpio_put(READ_BACK_SIGNAL, LOW);

    gpio_set_function(UART1_TX, GPIO_FUNC_UART);
    gpio_set_function(UART1_RX, GPIO_FUNC_UART);
}

void set_bus(uint16_t address) {
    int i;
    /* Write lowest bit into lowest address line first, then next-lowest bit, etc. */
    for (i = 0; i < BUS_SIZE; i++) {
        gpio_put(a_bus[i], address & 1);
        address >>= 1;
    }
}

bool uart_read_data_bus(bool data_bus_is_already_set, uint32_t *recv_data) {
    char buf[3] = { 0 };

    if (!data_bus_is_already_set) {
#ifdef SLOW_PERF 
        gpio_put(STATUS_LED, HIGH);
#endif
        uart_putc_raw(uart1, 'r');
    }
    uart_read_blocking(uart1, (uint8_t*)buf, sizeof(buf)-1);
    if (!data_bus_is_already_set) {
#ifdef SLOW_PERF
        while (uart_is_readable(uart1)) {
            printf("Read one more unexpected byte: %02x\n", uart_getc(uart1));
        }
        gpio_put(STATUS_LED, LOW);
#endif
    }
//     printf("Read something: %x %x\n", buf[0], buf[1]);
    if (!sscanf((char*)buf, "%02x", recv_data)) {
//         printf("Communication error?\n");
        return false;
    }
#ifdef SLOW_PERF
    else {
        printf("Read %02x / %03d\n", *recv_data, *recv_data);
    }
#endif
    return true;
}

void uart_write_data_bus(bool io, uint8_t send_data) {
    char buf[3] = { 0 };
    sprintf(buf, "%02x", send_data);
    gpio_put(STATUS_LED, HIGH);
    uart_putc_raw(uart1, io ? 'o' : 'w');
    uart_puts(uart1, buf);
    gpio_put(STATUS_LED, LOW);
}

uint32_t get_address_from_stdin() {
    char stdin_buf[5] = { 0 };
    uint32_t address = 0;
    stdin_buf[0] = getchar();
    stdin_buf[1] = getchar();
    stdin_buf[2] = getchar();
    stdin_buf[3] = getchar();
    sscanf(stdin_buf, "%04x", &address);
    printf("\n");
    return address;
}

uint32_t get_data_from_stdin() {
    char stdin_buf[5] = { 0 };
    uint32_t data = 0;
    printf("Data to write? (Two digit hex)\n");
    stdin_buf[0] = getchar();
    stdin_buf[1] = getchar();
    stdin_buf[2] = 0;
    sscanf(stdin_buf, "%02x", &data);
    printf("\n");
    return data;
}

void refresh_dram() {
    uint16_t row_address, temp;
    int i;
    set_bus(0);
    for (row_address = 0; row_address < 256; row_address++) {
        temp = row_address;
        for (i = 0; i < 8; i++) {
            gpio_put(a_bus[i], temp & 1);
            temp >>= 1;
        }
        gpio_put(RFSH, LOW);
        gpio_put(MREQ, LOW);
        // Need to wait at least 150 ns, 20 * (1000/133) ~= 150. In reality this probably works out to be more than 150 ns
        nop();
        nop();
        nop();
        nop();
        nop();

        nop();
        nop();
        nop();
        nop();
        nop();

        nop();
        nop();
        nop();
        nop();
        nop();

        nop();
        nop();
        nop();
        nop();
        nop();
        gpio_put(MREQ, HIGH);
        gpio_put(RFSH, HIGH);
        // Need to delay at least 100 ns, which we'll probably already be doing by setting up the bus with the next address, but let's add some NOPs anyway and then measure
        nop();
        nop();
        nop();
        nop();
        nop();

        nop();
        nop();
        nop();
        nop();
        nop();
    }
}

void write_vram_address(bool is_write, uint8_t high_byte, uint8_t low_byte) {
    if (is_write) {
        high_byte |= 0x40;
    }
    set_bus(0x99);

    uart_write_data_bus(true, low_byte);
    uart_getc(uart1); // Slave takes time to actually set data bus, let's wait for an 'f' char to indicate this op is done
    gpio_put(IOREQ, LOW);
    gpio_put(WR, LOW);
    sleep_ms(1);
    gpio_put(WR, HIGH);
    gpio_put(IOREQ, HIGH);

    uart_write_data_bus(true, high_byte);
    uart_getc(uart1); // Slave takes time to actually set data bus, let's wait for an 'f' char to indicate this op is done
    gpio_put(IOREQ, LOW);
    gpio_put(WR, LOW);
    sleep_ms(1);
    gpio_put(WR, HIGH);
    gpio_put(IOREQ, HIGH);
}

int main() {
//     char stdin_buf[5] = { 0 };
    uint32_t address = 0, end_address = 0;
    int i = 0;
    uint32_t j = 0, k = 0;
    char command = 0;
    uint32_t data = 0;
    uint32_t read_val = 0;
    char data_bus_ready = 0;
    uint32_t n_refresh_cycles = 0, n_writes_until_refresh = 0, only_write_first_time = 0;
    uint8_t high_byte = 0, low_byte = 0;
    int n_vram_writes = 0;
//     bool status = 0;

    stdio_init_all();
    setup();
    uart_init(uart1, BAUD);
    
    /* play with the LED so we can see that the program is about to start */
    for (i = 0; i < 5; i++) {
        gpio_put(STATUS_LED, HIGH);
        sleep_ms(500);
        gpio_put(STATUS_LED, LOW);
        sleep_ms(500);
    }
    sleep_ms(4000);

    printf("Ready\n");

    while (true) {
        read_val = 0;
    
        printf("Command?\n");
        command = getchar();
        printf("\nAddress? (Four digit hex)\n");
        address = get_address_from_stdin();
        switch (command) {
            case 'i':
                printf("Sending IOREQ (input)\n");
                set_bus(address);
                gpio_put(IOREQ, LOW);
                gpio_put(RD, LOW);
                if (!uart_read_data_bus(false, &read_val)) {
                    printf("RECV ERR\n");
                }
                gpio_put(RD, HIGH);
                gpio_put(IOREQ, HIGH);
                printf("Read from IO: %04x %02x\n", address, read_val);
                break;
            case 'o':
                printf("Sending IOREQ (output)\n");
                data = get_data_from_stdin();
                set_bus(address);
                uart_write_data_bus(true, data);
                uart_getc(uart1); // Slave takes time to actually set data bus, let's wait for an 'f' char to indicate this op is done
                gpio_put(IOREQ, LOW);
                gpio_put(WR, LOW);
                sleep_ms(1);
                gpio_put(WR, HIGH);
                gpio_put(IOREQ, HIGH);
                printf("Wrote to IO: %04x %1x\n", address, data);
                break;
            case 'v':
                low_byte = address & 0xff;
                high_byte = (address & 0x3f00) >> 8;
                printf("Number of writes? (Four digit hex)\n");
                n_vram_writes = get_address_from_stdin();
                data = get_data_from_stdin();

                write_vram_address(true, high_byte, low_byte);
                set_bus(0x98);
                uart_write_data_bus(true, data);
                uart_getc(uart1); // Slave takes time to actually set data bus, let's wait for an 'f' char to indicate this op is done
                for (i = 0; i < n_vram_writes; i++) {
                    gpio_put(IOREQ, LOW);
                    gpio_put(WR, LOW);
                    sleep_ms(1); // use VRAM_IO_DELAY macro?
                    gpio_put(WR, HIGH);
                    gpio_put(IOREQ, HIGH);
                }

                write_vram_address(false, high_byte, low_byte);
                set_bus(0x98);
                for (i = 0; i < n_vram_writes; i++) {
                    gpio_put(IOREQ, LOW);
                    gpio_put(RD, LOW);
                    uart_read_data_bus(false, &read_val);
                    gpio_put(RD, HIGH);
                    gpio_put(IOREQ, HIGH);
                    if (read_val != data) {
                        printf("Mismatch at VRAM address %04x: read %02x expected %02x\n", address+i, read_val, data);
                    }
                }
                break;
            case 'r':
                set_bus(address);
                gpio_put(MREQ, LOW);
                gpio_put(RD, LOW);
                gpio_put(M1, LOW);
//                 sleep_ms(1);
                if (!uart_read_data_bus(false, &read_val)) {
                    printf("RECV ERR\n");
//                     return 1;
                }

                printf("%04x %1x\n", address, read_val);
                printf("Press any key to advance cycle\n");
                getchar();
                printf("\n");
                sleep_ms(1);
                gpio_put(M1, HIGH);
                gpio_put(RD, HIGH);
                gpio_put(MREQ, HIGH);
                gpio_put(RFSH, LOW);
                sleep_ms(1);
                gpio_put(RFSH, HIGH);
                break;
            case 'w':
            case 'W':
                printf("End address? (Four digit hex, 0000 for 1-byte write)\n");
                end_address = get_address_from_stdin();
                if (end_address < address) {
                    end_address = address;
                }
                data = get_data_from_stdin();
                if (command == 'W') {
                    printf("Number of refresh cycles? (Four digit hex)\n");
                    n_refresh_cycles = get_data_from_stdin(); // use get_address_from_stdin to actually get four bytes
                    printf("Refresh every n writes? (Four digit hex)\n");
                    n_writes_until_refresh = get_data_from_stdin();
                    printf("Only write first time? (0000: no, 0001: yes)\n");
                    only_write_first_time = get_data_from_stdin();
                } else {
                    n_refresh_cycles = 1;
                }
                for (k = 0; k < n_refresh_cycles; k++) {
                    for (j = address; j <= end_address; j++) {
                        set_bus(j);
                        if (k == 0 || !only_write_first_time) {
                            uart_write_data_bus(false, data);
                            data_bus_ready = uart_getc(uart1);
                            if (data_bus_ready == 'f') { // Slave takes time to actually set data bus, let's wait for an 'f' char to indicate this op is done
                                printf("Got 'f'\n");
                            } else {
                                printf("Warning: got '%c' (%02x) instead of 'f'\n", data_bus_ready, data_bus_ready);
                            }
                            gpio_put(WR, LOW);
                            gpio_put(MREQ, LOW);
#define SLEEP_US_DURING_WRITE 3 // was 1
                            sleep_us(SLEEP_US_DURING_WRITE); // should be ~219 ns without this. that's okay generally speaking.
                            gpio_put(WR, HIGH);
                            gpio_put(MREQ, HIGH);
                        }

                        // S command has to take longer than RAM cycle time. should be okay.
                        uart_putc_raw(uart1, 'S'); // force data bus to gpio_in, and stand by for read
                        data_bus_ready = uart_getc(uart1); // gpio_in is done, slave is standing by

                        gpio_put(RD, LOW);
                        gpio_put(MREQ, LOW);
                        // perhaps our timing is just a bit tight? let's add a small delay
                        // much better but still not that great, let's do 1 us
#define SLEEP_US_DURING_READ_BACK 1 // was 1
                        sleep_us(SLEEP_US_DURING_READ_BACK);
                        gpio_put(READ_BACK_SIGNAL, HIGH);
                        while (!gpio_get(READ_BACK_SIGNAL_ACK)) {}

                        gpio_put(READ_BACK_SIGNAL, LOW);
                        gpio_put(RD, HIGH);
                        gpio_put(MREQ, HIGH);

                        if (!uart_read_data_bus(true, &read_val)) {
                            printf("RECV ERR\n");
                        }
                        printf("Wrote %04x and read back: %04x %1x\n", data, j, read_val);
                        if (command == 'W') {
                            if ((j-address) % n_writes_until_refresh == 0) {
                                refresh_dram();
                            }
                        }
                    }
                }
                break;
            case 's': // sync uart
                while (uart_is_readable(uart1)) {
                    printf("%c", uart_getc(uart1));
                }
                printf("\n");
                break;
            case '0':
                uart_putc_raw(uart1, '0');
                break;
            case 'u':
                uart_putc_raw(uart1, 'u');
                break;
            default:
                printf("Unknown command: \"%c\"\n", command);
        }
    }

    while (true) {
        gpio_put(STATUS_LED, HIGH);
        sleep_ms(1000);
        gpio_put(STATUS_LED, LOW);
        sleep_ms(1000);
    }
}

inspect_system_interactively_databus/inspect_system_interactively_databus.c:

#include <stdio.h>
#include "pico/stdlib.h"

#define D7           7
#define D6           6
#define D5           5
#define D4           4
#define D3           3
#define D2           2
#define D1           1
#define D0           0

#define UART1_TX     8
#define UART1_RX     9
// #define BAUD    345600
#define BAUD    460800

#define READ_BACK_SIGNAL 28
#define READ_BACK_SIGNAL_ACK 27

#define HIGH         1
#define LOW          0

#ifndef PICO_DEFAULT_LED_PIN
#error blink requires a board with a regular LED
#endif

#define STATUS_LED PICO_DEFAULT_LED_PIN
// #define STATUS_LED 15

#define println printf
// #define println(...) ;

inline void nop(void) {
    __asm__ __volatile__("nop\t\n");
}

void setup() {
    gpio_init(D0);
    gpio_init(D1);
    gpio_init(D2);
    gpio_init(D3);
    gpio_init(D4);
    gpio_init(D5);
    gpio_init(D6);
    gpio_init(D7);

    gpio_init(STATUS_LED);
    gpio_init(READ_BACK_SIGNAL);
    gpio_init(READ_BACK_SIGNAL_ACK);
    
    gpio_set_dir(D0, GPIO_IN);
    gpio_set_dir(D1, GPIO_IN);
    gpio_set_dir(D2, GPIO_IN);
    gpio_set_dir(D3, GPIO_IN);
    gpio_set_dir(D4, GPIO_IN);
    gpio_set_dir(D5, GPIO_IN);
    gpio_set_dir(D6, GPIO_IN);
    gpio_set_dir(D7, GPIO_IN);
    gpio_set_dir(STATUS_LED, GPIO_OUT);
    gpio_set_dir(READ_BACK_SIGNAL, GPIO_IN);
    gpio_set_dir(READ_BACK_SIGNAL_ACK, GPIO_OUT);

    gpio_put(READ_BACK_SIGNAL_ACK, LOW);

    gpio_set_function(UART1_TX, GPIO_FUNC_UART);
    gpio_set_function(UART1_RX, GPIO_FUNC_UART);
}

#define DATA_BUS_SIZE 8

const unsigned int d_bus[DATA_BUS_SIZE] = {
  D0, D1, D2, D3, D4, D5, D6, D7
};

void set_data_bus_dir(uint32_t dir) {
    gpio_set_dir(D0, dir);
    gpio_set_dir(D1, dir);
    gpio_set_dir(D2, dir);
    gpio_set_dir(D3, dir);
    gpio_set_dir(D4, dir);
    gpio_set_dir(D5, dir);
    gpio_set_dir(D6, dir);
    gpio_set_dir(D7, dir);
}

void set_data_bus(uint8_t data) {
    int i;
    set_data_bus_dir(GPIO_OUT);
    /* Write lowest bit into lowest address line first, then next-lowest bit, etc. */
    for (i = 0; i < DATA_BUS_SIZE; i++) {
        gpio_put(d_bus[i], data & 1);
        data >>= 1;
    }
}

void get_data_bus(uint8_t *data) {
    int i;
    *data = 0;
    /* Write lowest bit into lowest address line first, then next-lowest bit, etc. */
    for (i = 0; i < DATA_BUS_SIZE; i++) {
        *data |= (gpio_get(d_bus[i]) << i);
    }
}

int main() {
    uint32_t recv_data = 0;
    uint8_t recv_data8 = 0;
    char c;
    uint8_t buf[3] = { 0 };
    setup();
    stdio_init_all();
    uart_init(uart1, BAUD); // 115200);

    while(true) {
        printf("Repeating loop\n");
        gpio_put(STATUS_LED, HIGH);
        c = uart_getc(uart1);
        gpio_put(STATUS_LED, LOW);
        switch (c) {
            case 'w':
            case 'o':
                printf("Received w\n");
                uart_read_blocking(uart1, buf, sizeof(buf)-1);
                if (!sscanf((char*)buf, "%02x", &recv_data)) {
                    printf("Communication error?\n");
                } else {
                    printf("Going to put on data bus: %02x / %03d\n", recv_data, recv_data);
                }
                set_data_bus(recv_data);
                uart_putc_raw(uart1, 'f'); // indicate that we're done setting the bus
                break;
            case 'u': // unset data bus
            case 'S': // unset data bus and then stand by for read
                set_data_bus_dir(GPIO_IN);
                uart_putc_raw(uart1, 'f');
                if (c == 'S') { // stand by for read
                    // UART communication turned out to be a bit slow so we sacrificed one more pin just to get a signal when to read the bus
                    while (!gpio_get(READ_BACK_SIGNAL)) {} // wait for read signal
                    nop();
                    nop();
                    nop();
                    nop();
                    nop();
                    nop();
                    nop();
                    nop();
                    nop();
                    nop();
                    get_data_bus(&recv_data8);
                    gpio_put(READ_BACK_SIGNAL_ACK, HIGH);
                    recv_data = recv_data8;
                    sprintf((char*)buf, "%02x", recv_data);
                    printf("Read from data bus: %s\n", (char*)buf);
                    uart_puts(uart1, (char*)buf);
                    gpio_put(READ_BACK_SIGNAL_ACK, LOW);
                }
                break;
            case 'r':
                printf("Received r\n");
                set_data_bus_dir(GPIO_IN);
                get_data_bus(&recv_data8);
                recv_data = recv_data8;
                sprintf((char*)buf, "%02x", recv_data);
                printf("Read from data bus: %s\n", (char*)buf);
                uart_puts(uart1, (char*)buf);
                break;
            case '0': /* set data bus to 0 */
                set_data_bus(0);
                break;
            default:
                printf("Unexpected command char\n");
        }
    }
}

I’ll consider putting this on my Github account at some point. Code license is public domain.

Testing live (powered) 74xx logic chips in-circuit

Happy New Year! Hopefully with less coronavirus and less violence.

Somebody I know told me about their broken Amiga and how they had pulled out and tested every RAM/ROM/CPU chip and couldn’t find the fault, and now wanted to pull out every 74-series logic chip to test those too.

I didn’t have much to say at the time, but at some point I decided to try my hand at building an in-circuit chip tester, i.e., something that passively monitors a chip’s input and output pins while the device is running, and figures out if the chip is behaving correctly. All that is needed is an Arduino (Nano in my case) and a large IC test clip. The Arduino repeatedly samples all the pins and works out if the chip’s output pins are valid for the given inputs. This works great for simple logic chips (like NOT or AND), but not quite so well for chips with high-impedance modes.

The Arduino runs at 16 MHz, and is capable of reading in all pins in about 3 CPU cycles. The device under test would have to be slow enough (which is true for most computers considered “retro” at the time of this writing) to make this work. To avoid sampling while the inputs are changing (which is likely to show as a state that isn’t possible for a correctly functioning chip), the pins are sampled twice, and if samples 1 and 2 aren’t equal, the code samples the pins again, until they’re consistent. (Search for “WAIT_UNTIL_CONSISTENT” in the code for details.)

Note: I originally coded this for the Raspberry Pi Pico, but decided to use the Arduino instead because it’s fast enough and is 5V-tolerant.

How to use this

  1. I’d recommend probing using an oscilloscope first
  2. Paste program listing into the Arduino IDE
  3. Connect Arduino via USB to host computer
  4. Press Upload button
  5. Disconnect Arduino from host computer
  6. Wire up Arduino to IC test clip (D2 == pin 1, D3 == pin 2, …, D12 == pin 11, A0 == pin 12, …, A4 == pin 16)
  7. Make sure device to be tested is powered off and attach test clip to chip inside device
  8. Connect Arduino to host computer and open Serial Monitor in the Arduino IDE (or e.g. minicom)
  9. Turn on device to be tested and make sure device powers on normally (or if it never powered on normally in the first place, make sure that it isn’t worse now)
  10. In your serial terminal application, select what kind of chip to test and start monitoring (currently supported chips are: 74×00, 74×02, 74×04, 74×08, 74×32, 74×125, 74×138, 74×157)
  11. If you get errors where you didn’t expect any, check your connections (check if the test clip is seated correctly, especially)
  12. Turn off device under test
  13. Disconnect Arduino from host computer (Warning: It’s possible to power the Arduino through the GPIO pins. Doing this may damage the Arduino, so always make sure that the device under test is powered off before the Arduino is disconnected from USB.)

Here are some pics and screenshots:

Checking a working 74LS157 (terminal screenshot)
Checking a 74LS157
Probably also a 74LS157

Note, this is very beta-quality, or even “POC-quality”, software. In particular, I implemented some chip tests without ever testing them (because my device doesn’t have those chips). And the ones I did test, I tested only once or so. In even more particular, I don’t think I ever had a chance to test any chips with high impedance states. So it’s entirely possible that the code related to that is 100% bollocks. Use this code at your own risk and only if you mostly know what you are doing.

Another note: my test clip is a 14-pin clip, which means that pins 8 and 9 are read using GPIO pins that aren’t adjacent to the ones reading pins 7 or 10. Look for “USING_A_14_PIN_TEST_CLIP” to see how this is done.

#include <string.h>
#include <unistd.h>

#define ARRAY_SIZE(array) (sizeof(array)/sizeof(array[0]))

#define ALL_REGULAR_GPIO_PINS 0b00011100011111111111111111111111
#define CHIP_NAME_MAX_LENGTH 6

#define LOGIC_BUFFER_LEN 128
#define PRINTF_BUFFER_LEN 96

#define MAX_BAD_RATE 1 // percent
#define MAX_HIGH_Z_UNLIKELY_RATE 10 // percent

#define REPORT_STATISTICS_EVERY_N_SAMPLES 100000

#define WAIT_UNTIL_CONSISTENT 1

// set below define when testing 16-pin chips using a 14-pin test clip and two extra wires for pin 8 and pin 9, as such:
// gpio 0 will be connected to pin 1
// gpio 1 will be connected to pin 2
// ...
// gpio 6 will be connected to pin 7
// gpio 7 will be connected to pin 10(!)
// gpio 8 will be connected to pin 11
// ...
// gpio 13 will be connected to pin 16
// gpio 14 will be connected through extra wire to pin 8
// gpio 15 will be connected through extra wire to pin 9
#define USING_A_14_PIN_TEST_CLIP 1

#define printf(...) sprintf(printf_buffer, __VA_ARGS__); Serial.println(printf_buffer); Serial.flush();
#define printf_verbose(...) { if (verbose == true) { printf(__VA_ARGS__); } }

char printf_buffer[PRINTF_BUFFER_LEN];
uint16_t logic_buffer[LOGIC_BUFFER_LEN];
bool verbose = false;

// organically grown enum
enum test_result_enum {
    BAD = 0,
    GOOD = 1,
    HIGH_Z_UNLIKELY = 2,
    HIGH_Z_LIKELY = 3,
    HIGH_Z_INT2 = 4
};
struct test_result {
    enum test_result_enum test_result_enum;
    unsigned int int1;
    unsigned int int2;
};

// some definitions for convenience
const struct test_result TEST_RESULT_GOOD = {
    .test_result_enum = GOOD,
    .int1 = 0,
    .int2 = 0
};
const struct test_result TEST_RESULT_BAD = {
    .test_result_enum = BAD,
    .int1 = 0,
    .int2 = 0
};
const struct test_result TEST_RESULT_HIGH_Z_UNLIKELY = {
    .test_result_enum = HIGH_Z_UNLIKELY,
    .int1 = 0,
    .int2 = 0
};
const struct test_result TEST_RESULT_HIGH_Z_LIKELY = {
    .test_result_enum = HIGH_Z_LIKELY,
    .int1 = 0,
    .int2 = 0
};
const struct test_result TEST_RESULT_HIGH_Z_INT2 = {
    .test_result_enum = HIGH_Z_INT2,
    .int1 = 0,
    .int2 = 0
};

struct overlay_struct_14 {
    bool pin1 : 1;
    bool pin2 : 1;
    bool pin3 : 1;
    bool pin4 : 1;
    bool pin5 : 1;
    bool pin6 : 1;
    bool pin7 : 1;
    bool pin8 : 1;
    bool pin9 : 1;
    bool pin10 : 1;
    bool pin11 : 1;
    bool pin12 : 1;
    bool pin13 : 1;
    bool pin14 : 1;
} __attribute__((packed));

#ifndef USING_A_14_PIN_TEST_CLIP
struct overlay_struct_16 {
    bool pin1 : 1;
    bool pin2 : 1;
    bool pin3 : 1;
    bool pin4 : 1;
    bool pin5 : 1;
    bool pin6 : 1;
    bool pin7 : 1;
    bool pin8 : 1;
    bool pin9 : 1;
    bool pin10 : 1;
    bool pin11 : 1;
    bool pin12 : 1;
    bool pin13 : 1;
    bool pin14 : 1;
    bool pin15 : 1;
    bool pin16 : 1;
} __attribute__((packed));
#else // renumber some pins
struct overlay_struct_16 {
    bool pin1 : 1;
    bool pin2 : 1;
    bool pin3 : 1;
    bool pin4 : 1;
    bool pin5 : 1;
    bool pin6 : 1;
    bool pin7 : 1;
    bool pin10 : 1;
    bool pin11 : 1;
    bool pin12 : 1;
    bool pin13 : 1;
    bool pin14 : 1;
    bool pin15 : 1;
    bool pin16 : 1;
    bool pin8 : 1;
    bool pin9 : 1;
} __attribute__((packed));
#endif

static_assert(sizeof(struct overlay_struct_14) == 2, "overlay_struct_14 has to be exactly 2 bytes, otherwise it isn't a valid overlay struct for 14 pins!");
static_assert(sizeof(struct overlay_struct_16) == 2, "overlay_struct_16 has to be exactly 2 bytes, otherwise it isn't a valid overlay struct for 16 pins!");

union uint_on_overlay_struct {
    uint16_t uint;
    struct overlay_struct_14 overlay_struct_14;
    struct overlay_struct_16 overlay_struct_16;
};

enum chip_type {
    STATELESS_LOGIC = 0,
    STATEFUL_LOGIC = 1
};

const char *chip_names[] = {
    "74x00",
    "74x02",
    "74x04",
    "74x08",
    "74x32",
    "74x125",
    "74x138",
    "74x157"
};
const enum chip_type chip_types[] = {
    STATELESS_LOGIC,
    STATELESS_LOGIC,
    STATELESS_LOGIC,
    STATELESS_LOGIC,
    STATELESS_LOGIC,
    STATELESS_LOGIC,
    STATELESS_LOGIC,
    STATELESS_LOGIC
};

void no_power_wait(union uint_on_overlay_struct original_input) {
    printf("Chip isn't powered, inserting 1000 ms sleep\n");
    printf_verbose("Current state: %04x\n", original_input.uint);
    delay(1000);
}

void error_blink(void) {
    while (true) {
        digitalWrite(13, false);
        delay(50);
        digitalWrite(13, true);
        delay(50);
    }
}

struct test_result validate_74x00(union uint_on_overlay_struct original_input) {
    struct overlay_struct_14 input = original_input.overlay_struct_14;
    if (input.pin14) { // VCC
        if (((input.pin1 & input.pin2) != input.pin3) &&
            ((input.pin4 & input.pin5) != input.pin6) &&
            ((input.pin13 & input.pin12) != input.pin11) &&
            ((input.pin10 & input.pin9) != input.pin8)) {
            return TEST_RESULT_GOOD;
        } else {
            return TEST_RESULT_BAD;
        }
    } else {
        no_power_wait(original_input);
    }
    return TEST_RESULT_GOOD;
}
struct test_result validate_74x02(union uint_on_overlay_struct original_input) {
    struct overlay_struct_14 input = original_input.overlay_struct_14;
    if (input.pin14) { // VCC
        if (((input.pin2 | input.pin3) != input.pin1) &&
            ((input.pin5 | input.pin6) != input.pin4) &&
            ((input.pin12 | input.pin11) != input.pin13) &&
            ((input.pin9 | input.pin8) != input.pin10)) {
            return TEST_RESULT_GOOD;
        } else {
            return TEST_RESULT_BAD;
        }
    } else {
        no_power_wait(original_input);
    }
    return TEST_RESULT_GOOD;
}
struct test_result validate_74x04(union uint_on_overlay_struct original_input) {
    struct overlay_struct_14 input = original_input.overlay_struct_14;
    if (input.pin14) { // VCC
        if ((input.pin1 != input.pin2) &&
            (input.pin3 != input.pin4) &&
            (input.pin5 != input.pin6) &&
            (input.pin13 != input.pin12) &&
            (input.pin11 != input.pin10) &&
            (input.pin9 != input.pin8)) {
            return TEST_RESULT_GOOD;
        } else {
            return TEST_RESULT_BAD;
        }
    } else {
        no_power_wait(original_input);
    }
    return TEST_RESULT_GOOD;
}
struct test_result validate_74x08(union uint_on_overlay_struct original_input) {
    struct overlay_struct_14 input = original_input.overlay_struct_14;
    if (input.pin14) { // VCC
        if (((input.pin1 & input.pin2) == input.pin3) &&
            ((input.pin4 & input.pin5) == input.pin6) &&
            ((input.pin13 & input.pin12) == input.pin11) &&
            ((input.pin10 & input.pin9) == input.pin8)) {
            return TEST_RESULT_GOOD;
        } else {
            return TEST_RESULT_BAD;
        }
    } else {
        no_power_wait(original_input);
    }
    return TEST_RESULT_GOOD;
}
struct test_result validate_74x32(union uint_on_overlay_struct original_input) {
    struct overlay_struct_14 input = original_input.overlay_struct_14;
    if (input.pin14) { // VCC
        if (((input.pin1 | input.pin2) == input.pin3) &&
            ((input.pin4 | input.pin5) == input.pin6) &&
            ((input.pin13 | input.pin12) == input.pin11) &&
            ((input.pin10 | input.pin9) == input.pin8)) {
            return TEST_RESULT_GOOD;
        } else {
            return TEST_RESULT_BAD;
        }
    } else {
        no_power_wait(original_input);
    }
    return TEST_RESULT_GOOD;
}

struct test_result validate_74x125(union uint_on_overlay_struct original_input) {
    struct overlay_struct_14 input = original_input.overlay_struct_14;
    unsigned int unlikely = 0;
    struct test_result res = TEST_RESULT_HIGH_Z_INT2;
    if (input.pin14) { // VCC
        if (!input.pin1) {
            if (input.pin3 != input.pin2) return TEST_RESULT_BAD;
        }
        if (!input.pin4) {
            if (input.pin6 != input.pin5) return TEST_RESULT_BAD;
        }
        if (!input.pin13) {
            if (input.pin12 != input.pin11) return TEST_RESULT_BAD;
        }
        if (!input.pin10) {
            if (input.pin9 != input.pin8) return TEST_RESULT_BAD;
        }

        if (input.pin1) {
            if (input.pin3 != input.pin2) unlikely++;
        }
        if (input.pin4) {
            if (input.pin6 != input.pin5) unlikely++;
        }
        if (input.pin13) {
            if (input.pin12 != input.pin11) unlikely++;
        }
        if (input.pin10) {
            if (input.pin9 != input.pin8) unlikely++;
        }
        res = TEST_RESULT_HIGH_Z_INT2;
        res.int1 = unlikely;
        res.int2 = 4-unlikely; // 4 is the number of outputs on this chip
        return res;
    } else {
        no_power_wait(original_input);
    }
    return TEST_RESULT_GOOD;
}

struct test_result validate_74x138(union uint_on_overlay_struct original_input) {
    struct overlay_struct_16 input = original_input.overlay_struct_16;
    uint8_t select = input.pin1 | (input.pin2 << 1) | (input.pin3 << 2);
    uint8_t output = input.pin15 | (input.pin14 << 1) | (input.pin13 << 2) | (input.pin12 << 3) | (input.pin11 << 4) | (input.pin10 << 5) | (input.pin9 << 6) | (input.pin7 << 7);
    if (input.pin16) { // VCC
        if (input.pin6 & !input.pin4 & !input.pin5) { // chip enabled
            // select == 0 then 1, select == 1 then 2, select == 2 then 4, select == 3 then 8, ...
            if (output == ~(1<<select)) {
                return TEST_RESULT_GOOD;
            } else {
                return TEST_RESULT_BAD;
            }
        } else { // chip not enabled
            if (output == 0xff) {
                return TEST_RESULT_GOOD;
            } else {
                return TEST_RESULT_BAD;
            }
        }
    } else {
        no_power_wait(original_input);
    }
    return TEST_RESULT_GOOD;
}
struct test_result validate_74x157(union uint_on_overlay_struct original_input) {
    struct overlay_struct_16 input = original_input.overlay_struct_16;
    uint8_t select = input.pin1;

    if (input.pin16) { // VCC
        if (!input.pin15) { // chip enabled
            if (!select) {
                if ((input.pin4 == input.pin2) &&
                    (input.pin7 == input.pin5) &&
                    (input.pin12 == input.pin14) &&
                    (input.pin9 == input.pin11)) {
                    return TEST_RESULT_GOOD;
                } else {
                    return TEST_RESULT_BAD;
                }
            } else {
                if ((input.pin4 == input.pin3) &&
                    (input.pin7 == input.pin6) &&
                    (input.pin12 == input.pin13) &&
                    (input.pin9 == input.pin10)) {
                    return TEST_RESULT_GOOD;
                } else {
                    return TEST_RESULT_BAD;
                }
            }
        } else { // chip not enabled
            // high-impedance
            // we can check for high-impedance heuristically
            // check for activity that would not be possible with an enabled (and working) chip
            // partially mirrors above code
            if (!select) {
                if ((input.pin4 == input.pin2) &&
                    (input.pin7 == input.pin5) &&
                    (input.pin12 == input.pin14) &&
                    (input.pin9 == input.pin11)) {
                    return TEST_RESULT_HIGH_Z_UNLIKELY;
                } else {
                    return TEST_RESULT_HIGH_Z_LIKELY;
                }
            } else {
                if ((input.pin4 == input.pin3) &&
                    (input.pin7 == input.pin6) &&
                    (input.pin12 == input.pin13) &&
                    (input.pin9 == input.pin10)) {
                    return TEST_RESULT_HIGH_Z_UNLIKELY;
                } else {
                    return TEST_RESULT_HIGH_Z_LIKELY;
                }
            }
        }
    } else {
        no_power_wait(original_input);
    }
    return TEST_RESULT_GOOD;
}

struct test_result (*chip_check_funcs[])(union uint_on_overlay_struct) = {
    validate_74x00,
    validate_74x02,
    validate_74x04,
    validate_74x08,
    validate_74x32,
    validate_74x125,
    validate_74x138,
    validate_74x157
};

void setup() {
    unsigned int i = 0, j = 0;
    size_t chars_read = 0;
    char input_buffer[CHIP_NAME_MAX_LENGTH+1] = { 0 };
    bool found = false;
    union uint_on_overlay_struct gpio_input = { 0 };
    register uint8_t gpio_input_b = 0, gpio_input_c = 0, gpio_input_d = 0; // intended effect of 'register': gpio_input_b = PORTB is translated into a single instruction (e.g., in r24, 0x05). seems to work as intended.
    register uint8_t gpio_input_b2 = 0, gpio_input_c2 = 0, gpio_input_d2 = 0;
    struct test_result test_result = { 0 };
    unsigned long int bad = 0, good = 0, high_z_likely = 0, high_z_unlikely = 0, high_z_fifty_fifty_matched = 0, high_z_fifty_fifty_unmatched = 0;
    unsigned long int bad_rate = 0, high_z_unlikely_rate = 0, high_z_fifty_fifty_matched_rate = 0;
    unsigned long int n_bool = 0, n_high_z = 0, n_high_z_fifty_fifty = 0, n = 0;

    Serial.begin(115200);

    pinMode(2, INPUT);
    pinMode(3, INPUT);
    pinMode(4, INPUT);
    pinMode(5, INPUT);
    pinMode(6, INPUT);
    pinMode(7, INPUT);
    pinMode(8, INPUT);
    pinMode(9, INPUT);

    pinMode(10, INPUT);
    pinMode(11, INPUT);
    pinMode(12, INPUT);
    pinMode(A0, INPUT);
    pinMode(A1, INPUT);
    pinMode(A2, INPUT);
    pinMode(A3, INPUT);
    pinMode(A4, INPUT);

    pinMode(13, OUTPUT); // onboard LED

    // wait a little bit
    for (i = 0; i < 5; i++) {
        digitalWrite(13, HIGH);
        delay(500);
        digitalWrite(13, LOW);
        delay(500);
    }
    digitalWrite(13, HIGH);
    printf("Passive logic IC tester\n");
    printf("Make sure that voltages on pins to be tested are between 0 and 5V\n");
    printf("Don't forget to connect GND\n\n");

    Serial.flush(); // flush serial output
    while (Serial.available()) {
        Serial.read(); // flush serial input
    }

    printf("Verbose mode? (y/N)\n");
    while (!Serial.available());
    switch (Serial.read()) {
        case 'y':
        case 'Y':
            verbose = true;
            break;
        default:
            verbose = false;
    }

    while (true) {
        printf("Please enter IC to test (e.g., '74x04'). Backspace, cursor keys, etc., aren't supported.\n");
        printf("Supported ICs:\n");
        printf("74x00 (quad 2-input nand)\n");
        printf("74x02 (quad 2-input nor)\n");
        printf("74x04 (hex inverter)\n");
        printf("74x08 (quad 2-input and)\n");
        printf("74x32 (quad 2-input or)\n");
        printf("74x125 (quad bus buffer, negative enable)\n");
        printf("74x138 (3-to-8 decoder, inverting inputs)\n");
        printf("74x157 (quad 2-line to 1-line data selector, non-inverting outputs)\n");

        Serial.flush(); // flush serial output
        while (Serial.available()) {
            Serial.read(); // flush serial input
        }
        for (chars_read = 0; chars_read < CHIP_NAME_MAX_LENGTH; chars_read++) {
            while (!Serial.available());
            input_buffer[chars_read] = Serial.read();
        }
        printf("\n");
        for (i = 0; i < ARRAY_SIZE(chip_names); i++) {
            if (memcmp(input_buffer, chip_names[i], min(chars_read, strlen(chip_names[i]))) == 0) {
                printf("Found at %d\n", i);
                delay(1000);
                printf("Going to test a %s IC, press 'r' to re-select, or any other key to continue\n", chip_names[i]);
                while (!Serial.available());
                if (Serial.read() != 'r') {
                    found = true;
                }
                printf("\n");
                break; // break inner for loop
            }
        }
        if (found) {
            break; // break while loop
        }
        if (i == ARRAY_SIZE(chip_names)) {
            printf("Unknown chip: \"%s\"\n", input_buffer);
        }
    }

    if (chip_types[i] == STATELESS_LOGIC) {
        while (true) {
            while (true) {
                gpio_input_d = PIND;
                gpio_input_b = PINB;
                gpio_input_c = PINC;
#ifdef WAIT_UNTIL_CONSISTENT
                gpio_input_d2 = PIND;
                gpio_input_b2 = PINB;
                gpio_input_c2 = PINC;
                if ((gpio_input_d == gpio_input_d2) &&
                    (gpio_input_b == gpio_input_b2) &&
                    (gpio_input_c == gpio_input_c2)) {
                    break;
                }
#else
                break;
#endif
            }
            gpio_input.uint = ((gpio_input_d & 0xfc) >> 2) | ((uint16_t)(gpio_input_b & 0x1f) << 6) | ((uint16_t)(gpio_input_c & 0x1f) << 11);
            test_result = chip_check_funcs[i](gpio_input);
            switch (test_result.test_result_enum) {
                case BAD:
                    bad++;
                    printf_verbose("Bad state: %04x\n", gpio_input.uint);
                    break;
                case GOOD:
                    good++;
                    break;
                case HIGH_Z_UNLIKELY:
                    high_z_unlikely++;
                    break;
                case HIGH_Z_LIKELY:
                    high_z_likely++;
                    break;
                case HIGH_Z_INT2:
                    high_z_fifty_fifty_matched += test_result.int1;
                    high_z_fifty_fifty_unmatched += test_result.int2;
                    break;
            }
            n_bool = good + bad;
            n_high_z = high_z_likely + high_z_unlikely;
            n_high_z_fifty_fifty = high_z_fifty_fifty_matched + high_z_fifty_fifty_unmatched;
            n = n_bool + n_high_z + n_high_z_fifty_fifty;
            if ((n % REPORT_STATISTICS_EVERY_N_SAMPLES) == 0) {
                bad_rate = (100*bad)/n_bool; // fixed point, 0.01 -> 1
                high_z_unlikely_rate = (100*high_z_unlikely)/n_high_z;
                high_z_fifty_fifty_matched_rate = (100*high_z_fifty_fifty_matched)/n_high_z_fifty_fifty;
                printf("n: %lu\n", n);
                printf("n_bool: %lu bad_rate: %lu%%\n", n_bool, bad_rate);
                printf("n_high_z: %lu high_z_unlikely_rate: %lu%%\n", n_high_z, high_z_unlikely_rate);
                printf("n_high_z_fifty_fifty: %lu high_z_fifty_fifty_matched_rate: %lu%%\n", n_high_z_fifty_fifty, high_z_fifty_fifty_matched_rate);

// below lines are commented out because i didn't need this functionality after all    
//                if (bad_rate > MAX_BAD_RATE) { // some bad results are allowed because we might be sampling right before the chip had a chance to respond to inputs
//                    error_blink();
//                }
//                if (high_z_unlikely_rate > MAX_HIGH_Z_UNLIKELY_RATE) { // some bad results are allowed because we may sometimes sample right before the chip had a chance to respond to inputs
//                    error_blink();
//                }
            }
        }
    } else {
        for (j = 0; j < LOGIC_BUFFER_LEN; j++) {
            gpio_input_d = PORTD;
            gpio_input_b = PORTB;
            gpio_input_c = PORTC;
            logic_buffer[j] = ((gpio_input_d & 0xfc) >> 2) | ((uint32_t)(gpio_input_b & 0x1f) << 6) | ((uint32_t)(gpio_input_c & 0x1f) << 11);
        }
        for (j = 0; j < LOGIC_BUFFER_LEN; j++) {
            gpio_input.uint = logic_buffer[j];
            test_result = chip_check_funcs[i](gpio_input);
            switch (test_result.test_result_enum) {
                case BAD:
                    bad++;
                    printf_verbose("Bad state: %04x\n", logic_buffer[j]&0xffff);
                    break;
                case GOOD:
                    good++;
                    printf_verbose("Good state: %04x\n", logic_buffer[j]&0xffff);
                    break;
                case HIGH_Z_UNLIKELY:
                    high_z_unlikely++;
                    break;
                case HIGH_Z_LIKELY:
                    high_z_likely++;
                    break;
                case HIGH_Z_INT2:
                    high_z_fifty_fifty_matched += test_result.int1;
                    high_z_fifty_fifty_unmatched += test_result.int2;
                    break;
            }            
        }

        n_bool = good + bad;
        n_high_z = high_z_likely + high_z_unlikely;
        n_high_z_fifty_fifty = high_z_fifty_fifty_matched + high_z_fifty_fifty_unmatched;
        n = n_bool + n_high_z + n_high_z_fifty_fifty;

        bad_rate = (100*bad)/n_bool; // fixed point, 0.01 -> 1
        high_z_unlikely_rate = (100*high_z_unlikely)/n_high_z;
        high_z_fifty_fifty_matched_rate = (100*high_z_fifty_fifty_matched)/n_high_z_fifty_fifty;
        printf("n: %lu\n", n);
        printf("n_bool: %lu bad_rate: %lu%%\n", n_bool, bad_rate);
        printf("n_high_z: %lu high_z_unlikely_rate: %lu%%\n", n_high_z, high_z_unlikely_rate);
        printf("n_high_z_fifty_fifty: %lu high_z_fifty_fifty_matched_rate: %lu%%\n", n_high_z_fifty_fifty, high_z_fifty_fifty_matched_rate);
    }
}

void loop() {
}

MSX / MSX2 bank switching, and short simple RAM test in BASIC

In this article, we’ll create a short memory test for use on MSX/MSX2 machines to check the lower 32 KB of RAM. Why only the lower 32 KB of RAM? Because you can check the higher 32 KB using pure BASIC PEEKs and POKEs, and generally software won’t run if the higher 32 KB has defects. (Many games may still run even with defects in the lower 32 KB.)

It’s useful to have a test that can be run from BASIC and that can be typed into the machine in a couple minutes.

MSX bank switching

The MSX has a CPU that can only address 65536 addresses, but can have 64 KB of RAM and at least 16 KB of ROM. In a previous article, I mentioned that that is probably handled by copying ROM into RAM, but that is wrong. Instead, there is a chip that has a couple registers and enables/disables ROM/RAM chips based on the value of one of those registers.

On some MSX machines (but not the ones I tried) you may be able to read out that register from BASIC:

print inb(&ha8)

Summary: ROM/RAM can be switched in/out in 16 KB chunks, 0x0000-0x3fff, 0x4000-0x7fff, 0x8000-0xbfff, 0xc000-0xffff. There are four choices possible for each chunk. The register is 8 bits: 2 bits for the first chunk, 2 bits for the second chunk, etc.

There is another register (or rather another register for each slot) though, and it’s often used on MSX2 machines. This register is accessed in memory space, not I/O space. (Memory address 65535, requires that the correct slot be selected in the 0xc000-0xffff chunk.) This register gives us another four choices to select a different ROM/RAM for each choice already made. In other words, we have 4 pages (chunks), 4 slots (ROM/RAM choices), and for each slot, another 4 “subslots” (ROM/RAM choices). If you didn’t 100% understand that, don’t worry, I don’t actually think it’s comprehensible the way I wrote it. But you may still be able to follow the discussion below.

When we start BASIC on a 64 KB machine, we’ll probably have the lower 32 KB mapped to some kind of ROM, and the upper 32 KB mapped to RAM. BASIC then lets us use about 28 KB of that RAM and reserves about 4 KB for its own purposes, or so I assume. (The firmware selects the right slots and subslots to make this work for the machine in question, and the required numbers are different depending on the machine’s configuration. Also remember that there are RAM extension cartridges. During the boot phase, the firmware actively probes the 16 KB chunks to see if something is RAM or not. BTW, if the RAM is sufficiently bad, it won’t detect it as RAM at all and you’ll never see the boot screen.)

BASIC runs from ROM. If we disable that ROM (the “slot” containing the ROM) and instead enable RAM (the “slot” containing the RAM) on that “page” (chunk), we’ll be pulling the carpet from under BASIC’s feet. So if we run a command like this in BASIC:

out &ha8,0

The system will freeze immediately. So instead, we’ll be writing our memory test in assembler and poke it into memory, and then execute it.

We’ll be using a total of 9 instructions in our memory test program. Even if you have never seen Z80 assembly code, the following is almost all you need to know: “di” disables interrupts, just in case. “ei” re-enables interrupts. “ret” returns from our code, i.e., we’ll go back to BASIC. “ld” loads some 8-bit value to destination, from source. Numbers in (parentheses) are like dereference in C. (Except for in/out, you always uses parentheses there.) “cp” is compare argument with register “a”. (Some instructions require the use of register “a”.) “jp” is jump. “jp z,…” is “jump if equal”. “z” meaning, “if zero flag is set”.

Let’s first take a look at a short program that switches slots, undos the switch, and returns. It looks like this:

org 0c000h

di ; disable interrupts
ld a,0ffh ; put "255" in register "a" (the correct value depends on your machine!)
out (0a8h),a ; enable IO write, and put "0xa8" on the address bus and contents of register "a" on the data bus
;subslot
ld hl,0ffffh ; put "65535" in register "hl"
ld (hl),0ffh ; write "255" into *hl, i.e. into address 65535 (the correct value to write depends on your machine!)
;/subslot


; now undo everything:

ld a,0f0h ; put "240" in register "a" (the correct value depends on your machine!)
out (0a8h),a ; enable IO write, and put "0xa8" on the address bus and contents of register "a" (i.e., 240) on the data bus
;subslot
ld hl,0ffffh ; put "65535" in register "hl"
ld (hl),0f0h ; write "0" into *hl, i.e. into address 65535 (the correct value to write depends on your machine!)
;/subslot
ei ; enable interrupts
ret ; return to caller (BASIC)

This code can be assembled using “z80asm”, which is available in Debian’s repositories at least. z80asm outputs a file called “a.bin”. We can convert that file to unsigned 8-bit integers for use on BASIC data lines using od -t u1 a.bin.

$ z80asm all_ram_and_back.asm 
$ od -t u1 a.bin 
0000000 243  62 255 211 168  33 255 255  54 255  62 240 211 168  33 255
0000020 255  54  15 251 201
0000025

Anyway, now we just need to add some code between the two snippets above. We want code that writes to memory addresses and then compares what was written. Here’s the annotated assembly code to do that:

ld hl,00h ; put 0 in register "hl"
write: ; this is a label that we can use in "jp" (jump) instructions
ld (hl),0ffh ; put 255 in *hl, i.e. address 0
inc hl ; increment hl register
ld a,h ; put high byte of "hl" register into "a" register
cp 080h ; check if the "a" register contains 0x80
jp z,done_writing ; if yes, that means we have incremented the hl register a bunch of times and it's time to go check if what we wrote earlier is still there (i.e. we've written 255 into 0x0000 to 0x7fff)
jp write ; if we reach this instruction, that means that the previous instruction didn't perform its conditional jump. this instruction jumps back to the instruction at the "write" label (ld (hl),0ffh). i.e., we haven't reached 0x8000 yet.

done_writing: ; this is a label
ld hl,00h ; put 0 in register "hl"
compare: ; this is a label
ld a,0ffh ; put 255 in register "a"
cp (hl) ; compare contents of register "a" with contents of *hl, i.e. memory address 0
jp nz,bad ; we put 255 in there earlier but this address contains something else now, that means we have bad memory
inc hl ; if we reach this instruction, that means that the previous instruction didn't perform its conditional jump. increment hl register
ld a,h ; put high byte of "hl" register into "a" register
cp 0c0h ; check if the "a" register contains 0xc0
jp z,done ; if yes, that means we have incremented the hl register a bunch of times and it's time to go home
jp compare ; if we reach this instruction, that means that the previous instruction didn't perform its conditional jump. this instruction jumps back to the instruction at the "compare" label (ld a,0ffh)

bad:
...

done:
...

Finding the correct values for I/O port A8 and memory port 65535

To make the assembled code work on your MSX, you will most likely have to change the values to be written into the A8 I/O register and the values to be written into address 65535 to select the correct sub-slot.

Let’s take a look at the “Slot Map” section on msx.org’s wiki page for the Sony HB-T7, which is the machine I’m trying to debug.

Screenshot from aforementioned page

We want to access RAM, and it’s at slot 3, subslot 3. In BASIC, I get the following output:

?peek(65535)
15

This output is inverted. 15 is 0b00001111 in binary, but we should read this as 0b11110000. The lowest two bits specify the subslot for the 0x0000-0x3fff block. The subslot is set to 0b00, i.e. 0 here. Same for 0x4000-0x7fff. On page 0x8000-0xbfff, we see that subslot 0b11, i.e., 3 is selected. Same for 0xc000-0xffff. This is as expected — the above table from the MSX wiki page specifies that all RAM is at subslot 3 (of slot 3).

If we want the lower pages to point to RAM, we not only have to set the A8 register to 3 (because all the RAM is at slot 3), we also have to select subslot 3.

I.e., to set the subslots for all four 16 KB chunks (“pages”) to use RAM, which is at slot 3 subslot 3, we have to write 0b11111111. To revert this, we have to write 0b11110000 (our peek showed 15 (0b00001111) at memory address 65535, but this is inverted, hence 0b11110000).

Customizing the memory test

In the above code, we write 0xff to all addresses, and then later check if 0xff is still there. However, one very common type of memory fault is “stuck bits”, where memory bits are always stuck at 1, even if we’d written 0. In order to test that, I recommend that you change “ld (hl),0ffh” and “ld a,0ffh” under the “write” and “compare” labels to different values.

We also have to think about what to do if we have encountered bad memory. One easy thing we can do is generate an audible click. (May be somewhat faint.) Here’s the code to do that:

bad:
ld a,15
out (0abh),a
ld a,14
out (0abh),a
ld a,15

Writing 15 and then 14 to 0xAB produces a click. You can do this in BASIC too:

out &hab,15:out &hab,14

Alternatively, we could write the memory address that contained something unexpected into some memory location, for example like this:

ld de,0c100h ; memory address to write to
ld a,h
ld (de),a
inc de
ld a,l
ld (de),a

This would write the significant byte of the failed address to 0xc100 and the insignificant byte to 0xc101.

Putting it all together

Here’s the assembly code for the whole test:

org 0c000h

di
ld a,0ffh
out (0a8h),a
;subslot
ld hl,0ffffh
ld (hl),0ffh
;/subslot
ld hl,00h
write:
ld (hl),0
; ld (hl),l ; alternative code, fills RAM with 0x00-0xff, 0x00-0xff, ...
; nop
inc hl
ld a,h
cp 080h
jp z,done_writing
jp write

done_writing:
ld hl,00h
compare:
ld a,0
; ld a,l ; alternative code, see above
; nop
cp (hl)
jp nz,bad
inc hl
ld a,h
cp 080h
jp z,done
jp compare

bad: ; audible click version
ld a,15
out (0abh),a
ld a,14
out (0abh),a
ld a,15

done:
ld a,0f0h
out (0a8h),a
;subslot
ld hl,0ffffh
ld (hl),0f0h
;/subslot
ei
ret

And here’s the short BASIC loader:

10 i=49152
20 read j
30 if j=-1 then goto 80
40 poke i,j
50 i=i+1
60 goto 20
70 data 243,62,255,211,168,33,255,255,54,255,33,0,0,54,255,35,124,254,128,202,25,192,195,13,192,33,0,0,62,255,190,194,44,192,35,124,254,128,202,54,192,195,28,192,62,16,211,171,62,14,211,171,62,15,62,240,211,168,33,255,255,54,240,251,201,-1
80 def usr1=49152

run
Ok.
x=usr1(0)
Ok.

If you get “Ok.” after “x=usr1(0)”, the machine hasn’t crashed (which means your slot and subslot selections were correct). If you heard a click, your memory is probably defective.

The click may be hard to hear, so maybe try running “x=usr1(0)” in a loop:

for i=0 to 100:x=usr1(0):next

Poor man’s unit tests for C (and maybe C++)

Developing software is mostly about tradeoffs.

Make the software easy to use by getting rid of advanced features? Make the software featureful but harder to use?

Make the software comfortable to develop for but invest a lot of time setting up frameworks and maintaining them? Or make the software a bit less comfortable to work on, but avoid spending a bunch of learning/maintaining the frameworks?

Well, it’s all up to you, and I have a strong belief that people shouldn’t have strong beliefs about this! Er, okay.

Let’s say you want to add a couple tests (b.c) for individual static functions hidden in a .c file somewhere (a.c). Well, you can’t access those static functions from other .c files. Put the tests in the .c file? Some people would call it ugly. You could also write a script that concatenates the source file and test file and compiles that instead! Bit messy. You can’t have multiple main() functions, etc. But have you ever considered making the “static” keyword disappear using the C preprocessor? It works! And it can be messy too because it’ll make your static variable non-static. Great. But there are cases where that doesn’t matter, especially when we’re just trying to run some unit tests. Here’s a minimum example:

a.c

#include <stdio.h>

static void a()
{
    printf("Hi :,\n");
}

b.c

void a(); // prototype

int main(int argc, char **argv)
{
    a();
}
$ cc -o foo a.c b.c
/usr/bin/ld: /tmp/ccpspoxC.o: in function `main':
b.c:(.text+0x15): undefined reference to `a'
collect2: error: ld returned 1 exit status
$

It doesn’t work, duh. Because a() is static.

And now we’re going to make static disappear and it’ll work, so you can put all your tests in what we called b.c:

$ cc -Dstatic= -o foo a.c b.c
$ # no errors

If you don’t have control over the code base you are working on but still want some quick tests, this hack may be useful.

Raspberry Pi Pico 15.6 KHz analog RGB to VGA upscaler (part 1? POC? WIP?)

My Hitachi MB-H2 MSX machine has an analog RGB port that produces a 15.6 KHz CSYNC (combined horizontal and vertical) signal and analog voltages indicating how red, green or yellow things are.

I recently unearthed my old LCD from 2006 or so and decided to see if I could get it to sync if I just massaged the CSYNC signal a bit to bring it to TTL levels and connected a VGA cable.

(Technical details: when you connect a VGA cable to a monitor that is powered on, you will often first of all see a message like “Cable not connected”. To get past that problem, you first have to ground a certain pin on the VGA connector. I found that female Dupont connectors fit reasonably well on male VGA connectors so I just used a cable with female Dupont connectors on both ends to connect the two relevant pins. I’m not sure if it’s the same pin on all monitors. You can find the pin by looking for a pin that should be GND according to the VGA pinout but actually has some voltage on it. Don’t blame me if you break your Dupont connectors by following this advice.)

Unfortunately, that didn’t work. I got “Input not supported”, and I am reasonably sure that is because my monitor doesn’t support 15 KHz signals. Aw, why’d I even bother taking it out of storage?

So what do we do… Well there is this library (PicoVGA) that produces VGA signals using the Raspberry Pi Pico’s PIOs. Raspberry Pi Picos are extremely cheap, just about 600 yen per piece where I am.

Yes, this is animated and super smooth.

Damn, I’ve seen this in videos, but seeing this in real life, a tiny, puny microcontroller generating fricking VGA signals! Amazing. Just last year I was playing around with monochrome composite output on an Arduino Nano, and even that was super impressive to me! (Cue people reading this 20 years in the future and laughing at the silly dude with the retro microcontroller from year-of-the-pandemic 2020. I’m sure microcontrollers in the 2040s will have 32 cores and dozens of pins with built-in 1 GHz DACs and ADCs and mains voltage tolerance, and will be able to generate a couple streams of 4K video ;D)

Some boring technical notes I took before embarking on the project, feel free to skip this section

Is the Pico’s VGA library magic? Yes, definitely. Can we add our own magic to simultaneously capture video and output it via the VGA library?
It sure looks like it! Why?

  • The Pico has two CPU cores, and the VGA library uses just one of them, the second core
    • Dual-core microcontroller, that’s craziness
  • We may be able to use the second core a little bit anyway (“If the second core is not very busy (e.g. when displaying 8-bit graphics that are simply transferred using DMA transfer), it can also be used for the main program work.”)
    • We will indeed be working with 8-bit graphics simply transferred using DMA
  • The Pico has two PIO controllers, and the VGA library uses just one (“The display of the image by the PicoVGA library is performed by the PIO processor controller. PIO0 is used. The other controller, PIO1, is unused and can be used for other purposes.”)

However:

  • We possibly won’t be able to use DMA all that much (“Care must also be taken when using DMA transfer. DMA is used to transfer data to the PIO. Although the transfer uses a FIFO cache, using a different DMA channel may cause the render DMA channel to be delayed and thus cause the video to drop out. A DMA overload can occur, for example, when a large block of data in RAM is transferred quickly. However, the biggest load is the DMA transfer of data from flash memory. In this case, the DMA channel waits for data to be read from flash via QSPI and thus blocks the DMA render channel.”)
    • If we use PIO and DMA for capturing video-in, we might run into trouble there
    • However, using DMA to capture and another DMA transfer to transfer the data to VGA out sounds somewhat inefficient; maybe it’s possible to directly transfer from capture PIO to VGA PIO? Would require modifications to the VGA library, which doesn’t sound so great right now (we didn’t do this)

That said, it’s likely that capturing without the use of PIO would be fast enough, generally speaking.
The “pixel clock” for a 320×200 @ 60 Hz signal is between 4.944 and 6 MHz according to https://tomverbeure.github.io/video_timings_calculator (select 320×200 / 60 in the drop-down menu), depending on some kind of mode that I don’t know anything about.
According to our oscilloscope capture of a single pixel on one of the color channels (DS1Z_QuickPrint22.png), we get about 5.102 MHz. Let’s take that value. We’ll hopefully be able to calculate the exact value at some point. (Yeah, the TMS59918A/TMS59928A/TMS59929A datasheet actually (almost) mentions the exact value! “The VDP is designed to operate with a 10.738635 (± 0.005) MHz crystal”, “This master clock is divided by two to generate the pixel clock (5.3 MHz)”. So it’s 5.3693175 MHz, thank you very much.)

This means that we have to be able to capture at exactly that frequency. From our previous experimental logic analyzer (which doesn’t use PIO) we were more than capable of capturing everything going on with our Z80 CPU — we had multiple samples of every single state the CPU happened to be in, and the CPU ran at 3.58 MHz. (However, if the VGA library chooses to set the CPU to use a lower clock frequency, we may run into problems. It’s possible to prevent the library from adjusting the clock frequency, but maybe that will impact image quality.) The main part of the code looked like this:

for (i = 0; i < LOGIC_BUFFER_LEN; i++) {
logic_buffer[i] = gpio_get_all() & ALL_REGULAR_GPIO_PINS;
}

To capture video, we’d like to post-process our capture just a little bit, to convert it to 3-3-2 RGB. Or we could post-process our capture during VSYNC, but that would be a rather tight fit, with only 1.2 ms to work with. (Actually, our signal’s VSYNC pulse is even shorter than that, but there’s nothing on the RGB pins for a while before and after that.)

So our loop might look like this. (Note, the code I ended up writing looks reasonably similar to this, which is why I’m including this here.)

for (x = 0; x < 320; x++) {
    pixel = gpio_get_all();
    red = msb_table_inverted[((pixel & R_MASK) >> R_SHIFT) << R_SHIFT];
    green = msb_table_inverted[((pixel & G_MASK) >> G_SHIFT) << G_SHIFT];
    blue = msb_table_inverted[((pixel & B_MASK) >> B_SHIFT) << B_SHIFT];
    capture[y][x] = red | (green << 3) | (blue << 6);
}

Where msb_table_inverted is a lookup table to convert our raw GPIO input to the proper R/G/B values. This depends on how we do the analog to digital conversion, so the loop might look slightly different in the end.

Well, how likely is it that this will produce a perfectly synced capture? About 0% in my opinion. If we’re too fast, we’ll get a horizontally compressed image. If we’re too slow, the image will be wider than it should be, and more importantly, cut off on the right side.
In the first case, we may be able to improve the situation by adding the right amount of NOPs.
In the second case, we could reduce the amount of on-the-fly post-processing, and do stuff during HBLANK or VBLANK instead.
In addition, we might miss a few pixels on the left side if we can’t begin capturing immediately when we get our HSYNC interrupt. How likely is this to succeed? It might work, I think.

The PIOs can also be used without DMA. (Instead of using DMA, we’d use functions like pio_sm_get_blocking().) With PIO, we can get perfect timing, which would be really great to have. We can’t off-load any arithmetic or bit twiddling operations, the PIOs don’t have that. So let’s dig in and run some experiments.

Implementation

The pico_examples repository has a couple of PIO examples. The PicoVGA library has a hello world example. I thought the logic_analyser example in pico_examples looked like a good start. It’s really quite amazing.

  • You can specify the number of samples you’d like to read (const uint CAPTURE_N_SAMPLES = 96)
  • You can specify the number of pins you’d like to sample from (const uint CAPTURE_PIN_COUNT = 2)
  • You can specify the frequency you’d like to read at (logic_analyser_init(pio, sm, CAPTURE_PIN_BASE, CAPTURE_PIN_COUNT, 1.f), where “1.f” is a divider of the system clock. I.e., this will capture at system clock speed. We can specify a float number here.)
  • The PIO input is (mostly?) independent from what else you have going on on that pin, so the code of course proceeds to configure a PWM signal on a pin, and to capture from that same pin. Bonkers!

Well, let’s cut to the chase, shall we? I took parts of the logic_analyser code to capture the input from RGB, then wrote some code to massage the captured data a little bit, and then output everything using PicoVGA at a higher resolution. After some troubleshooting, I got a readable signal!

However, my capture has wobbly scanlines. Which is why there might be a part 2. And since it’s wobbly, I spent even less effort on the analog to digital conversion than I’d originally planned, which was already rather “poor man” (more on that later, because the code assumes that circuit exists).

I’m triggering the capture by looking for a positive to negative transition. (That’s already two out of the three instructions my PIO program consists of, one to wait for positive, one to wait for negative.) I currently don’t really know why my scanlines are wobbly. I had a few looks with the oscilloscope to see if there’s anything wrong in my circuit that converts CSYNC to TTL levels — for example, slow response from the transistor. But I didn’t find anything so far. :3 It’s of course entirely possible that the source signal is wonky. I’ve never had a chance to connect my MSX to a monitor that supports 15 KHz signals. (Now that’s a major TODO right there.) Of course there are other ways to check if the signal is okay.

We could also (hopefully) get rid of the wobbling by only paying attention to the VSYNC and timing scanlines ourselves, for example by generating them using the Pico’s PWM. As seen in the original logic_analyser.c code! But that’s something for part 2 I guess.

BTW, it’s unlikely that the wobbliness is being caused by a problem with the code or resource contention. I tested this by switching the capture to an off-screen buffer after a few seconds. The screen displayed the last frame captured into the real framebuffer, and was entirely static. I.e., I added code like this into the main loop (which you will see below):

+        if (j > 600) {
+            rgb_buf = fake_rgb_buf;
+            gpio_put(PICO_DEFAULT_LED_PIN, true);
+        } else {
+            j++;
+        }

Poor-man’s ADC

What I actually planned to do: the program I wrote expects four different levels of red, green, and blue. There are three pins per color, and if all pins of a color are 0, that means that color is 0, if only one is 1, that’s still quite dark, if two are 1, that’s somewhat bright, and if all three are 1, then that’s bright. The program then converts that into two bits (0, 1, 2, 3); PicoVGA works with 8-bit colors, 3 bits for red, 3 bits for green, 2 bits for blue. That means that we can capture all the blue we need, and for red and green we could scale the numbers a bit. However, I shelved that plan for now, because I don’t even have enough potentiometers at the moment, and if the signal is as wobbly as it is, that’s just putting lipstick on a pig. Instead, I just took a single color (blue, just because that was less likely to short my MacGyver wiring), and feed that into all colors’ “bright” pin.

As my MSX’s RGB signal voltages are a bit funky (-0.7 to 0.1 IIRC), I converted that to something the Pico can understand using a simple class A-kinda amplifier. The signal gets inverted by this circuit, but that’s fine for a POC. Completely blue will be black, and vice versa.

So here’s the code:

#include "include.h"

#include <stdio.h>
#include <stdlib.h>

#include "pico/stdlib.h"
#include "hardware/pio.h"
#include "hardware/dma.h"
#include "hardware/structs/bus_ctrl.h"

// Some logic to analyse:
#include "hardware/structs/pwm.h"

const uint CAPTURE_PIN_BASE = 9;
const uint CAPTURE_PIN_COUNT = 10; // CSYNC, 3*R, 3*G, 3*B

const float PIXEL_CLOCK = 5369.3175f; // datasheet (TMS9918A_TMS9928A_TMS9929A_Video_Display_Processors_Data_Manual_Nov82.pdf) page 3-8 / section 3.6.1 says 5.3693175 MHz (10.73865/2)
// from same page on datasheet
// HORIZONTAL                   PATTERN OR MULTICOLOR   TEXT
// HORIZONTAL ACTIVE DISPLAY    256                     240
// RIGHT BORDER                 15                      25
// RIGHT BLANKING               8                       8
// HORIZONTAL SYNC              26                      26
// LEFT BLANKING                2                       2
// COLOR BURST                  14                      14
// LEFT BLANKING                8                       8
// LEFT BORDER                  13                      19
// TOTAL                        342                     342

const uint INPUT_VIDEO_WIDTH = 308; // left blanking + color burst + left blanking + left border + active + right border

// VERTICAL                     LINE
// VERTICAL ACTIVE DISPLAY      192
// BOTTOM BORDER                24
// BOTTOM BLANKING              3
// VERTICAL SYNC                3
// TOP BLANKING                 13
// TOP BORDER                   27
// TOTAL                        262

const uint INPUT_VIDEO_HEIGHT = 240; // top blanking + top border + active + 1/3 of bottom border
const uint INPUT_VIDEO_HEIGHT_OFFSET_Y = 40; // ignore top 40 (top blanking + top border) scanlines
// we're capturing everything there is to see on the horizontal axis, but throwing out most of the border on the vertical axis
// NOTE: other machines probably have different blanking/border periods

const uint CAPTURE_N_SAMPLES = INPUT_VIDEO_WIDTH;

const uint OUTPUT_VIDEO_WIDTH = 320;
const uint OUTPUT_VIDEO_HEIGHT = 200;

static_assert(OUTPUT_VIDEO_WIDTH >= INPUT_VIDEO_WIDTH);
static_assert(OUTPUT_VIDEO_HEIGHT >= INPUT_VIDEO_HEIGHT-INPUT_VIDEO_HEIGHT_OFFSET_Y);

uint offset; // Lazy global variable; this holds the offset of our PIO program

// Framebuffer
ALIGNED u8 rgb_buf[OUTPUT_VIDEO_WIDTH*OUTPUT_VIDEO_HEIGHT];

static inline uint bits_packed_per_word(uint pin_count) {
    // If the number of pins to be sampled divides the shift register size, we
    // can use the full SR and FIFO width, and push when the input shift count
    // exactly reaches 32. If not, we have to push earlier, so we use the FIFO
    // a little less efficiently.
    const uint SHIFT_REG_WIDTH = 32;
    return SHIFT_REG_WIDTH - (SHIFT_REG_WIDTH % pin_count);
}

void logic_analyser_init(PIO pio, uint sm, uint pin_base, uint pin_count, float div) {
    // Load a program to capture n pins. This is just a single `in pins, n`
    // instruction with a wrap.
    uint16_t capture_prog_instr[3];
    capture_prog_instr[0] = pio_encode_wait_gpio(false, pin_base);
    capture_prog_instr[1] = pio_encode_wait_gpio(true, pin_base);
    capture_prog_instr[2] = pio_encode_in(pio_pins, pin_count);
    struct pio_program capture_prog = {
            .instructions = capture_prog_instr,
            .length = 3,
            .origin = -1
    };
    offset = pio_add_program(pio, &capture_prog);

    // Configure state machine to loop over this `in` instruction forever,
    // with autopush enabled.
    pio_sm_config c = pio_get_default_sm_config();
    sm_config_set_in_pins(&c, pin_base);
    sm_config_set_wrap(&c, offset+2, offset+2); // do not repeat pio_encode_wait_gpio instructions
    sm_config_set_clkdiv(&c, div);
    // Note that we may push at a < 32 bit threshold if pin_count does not
    // divide 32. We are using shift-to-right, so the sample data ends up
    // left-justified in the FIFO in this case, with some zeroes at the LSBs.
    sm_config_set_in_shift(&c, true, true, bits_packed_per_word(pin_count)); // push when we have reached 32 - (32 % pin_count) bits (27 if pin_count==9, 30 if pin_count==10)
    sm_config_set_fifo_join(&c, PIO_FIFO_JOIN_RX); // TX not used, so we can use everything for RX
    pio_sm_init(pio, sm, offset, &c);
}

void logic_analyser_arm(PIO pio, uint sm, uint dma_chan, uint32_t *capture_buf, size_t capture_size_words,
                        uint trigger_pin, bool trigger_level) {
    pio_sm_set_enabled(pio, sm, false);
    // Need to clear _input shift counter_, as well as FIFO, because there may be
    // partial ISR contents left over from a previous run. sm_restart does this.
    pio_sm_clear_fifos(pio, sm);
    pio_sm_restart(pio, sm);

    dma_channel_config c = dma_channel_get_default_config(dma_chan);
    channel_config_set_read_increment(&c, false);
    channel_config_set_write_increment(&c, true);
    channel_config_set_dreq(&c, pio_get_dreq(pio, sm, false)); // pio_get_dreq returns something the DMA controller can use to know when to transfer something

    dma_channel_configure(dma_chan, &c,
        capture_buf,        // Destination pointer
        &pio->rxf[sm],      // Source pointer
        capture_size_words, // Number of transfers
        true                // Start immediately
    );

    pio_sm_exec(pio, sm, pio_encode_jmp(offset)); // just restarting doesn't jump back to the initial_pc AFAICT
    pio_sm_set_enabled(pio, sm, true);
}

void blink(uint32_t ms=500)
{
    gpio_put(PICO_DEFAULT_LED_PIN, true);
    sleep_ms(ms);
    gpio_put(PICO_DEFAULT_LED_PIN, false);
    sleep_ms(ms);
}

// uint8_t msb_table_inverted[8] = { 3, 3, 3, 3, 2, 2, 1, 0 };
uint8_t msb_table_inverted[8] = { 0, 1, 2, 2, 3, 3, 3, 3 };

void post_process(uint8_t *rgb_bufy, uint32_t *capture_buf, uint buf_size_words)
{
    uint16_t i, j, k;
    uint32_t temp;
    for (i = 8, j = 0; i < buf_size_words; i++, j += 3) { // start copying at pixel 24 (8*3) (i.e., ignore left blank and color burst, exactly 24 pixels).
        temp = capture_buf[i] >> (2+1); // 2: we're only shifting in 30 bits out of 32, 1: ignore csync
        rgb_bufy[j] = msb_table_inverted[temp & 0b111]; // red
        rgb_bufy[j] |= (msb_table_inverted[(temp & 0b111000) >> 3] << 3); // green
        rgb_bufy[j] |= (msb_table_inverted[(temp & 0b111000000) >> 6] << 6); // blue
        temp >>= 10; // go to next sample, ignoring csync
        rgb_bufy[j+1] = msb_table_inverted[temp & 0b111]; // red
        rgb_bufy[j+1] |= (msb_table_inverted[(temp & 0b111000) >> 3] << 3); // green
        rgb_bufy[j+1] |= (msb_table_inverted[(temp & 0b111000000) >> 6] << 6); // blue
        temp >>= 10; // go to next sample, ignoring csync
        rgb_bufy[j+2] = msb_table_inverted[temp & 0b111]; // red
        rgb_bufy[j+2] |= (msb_table_inverted[(temp & 0b111000) >> 3] << 3); // green
        rgb_bufy[j+2] |= (msb_table_inverted[(temp & 0b111000000) >> 6] << 6); // blue
    }
}

int main()
{
    uint16_t i, y;

    gpio_init(PICO_DEFAULT_LED_PIN);
    gpio_init(CAPTURE_PIN_BASE);
    gpio_set_dir(PICO_DEFAULT_LED_PIN, GPIO_OUT);
    gpio_set_dir(CAPTURE_PIN_BASE, GPIO_IN);

    blink();

    // initialize videomode
    Video(DEV_VGA, RES_CGA, FORM_8BIT, rgb_buf);

    blink();

    // We're going to capture into a u32 buffer, for best DMA efficiency. Need
    // to be careful of rounding in case the number of pins being sampled
    // isn't a power of 2.
    uint total_sample_bits = CAPTURE_N_SAMPLES * CAPTURE_PIN_COUNT;
    total_sample_bits += bits_packed_per_word(CAPTURE_PIN_COUNT) - 1;
    uint buf_size_words = total_sample_bits / bits_packed_per_word(CAPTURE_PIN_COUNT);
    uint32_t *capture_buf0 = (uint32_t*)malloc(buf_size_words * sizeof(uint32_t));
    hard_assert(capture_buf0);
    uint32_t *capture_buf1 = (uint32_t*)malloc(buf_size_words * sizeof(uint32_t));
    hard_assert(capture_buf1);

    blink();

    // Grant high bus priority to the DMA, so it can shove the processors out
    // of the way. This should only be needed if you are pushing things up to
    // >16bits/clk here, i.e. if you need to saturate the bus completely.
    // (Didn't try this)
//     bus_ctrl_hw->priority = BUSCTRL_BUS_PRIORITY_DMA_W_BITS | BUSCTRL_BUS_PRIORITY_DMA_R_BITS;

    PIO pio = pio1;
    uint sm = 0;
    uint dma_chan = 8; // 0-7 may be used by VGA library (depending on resolution)

    logic_analyser_init(pio, sm, CAPTURE_PIN_BASE, CAPTURE_PIN_COUNT, (float)Vmode.freq/PIXEL_CLOCK);

    blink();

    // 1) DMA in 1st scan line, wait for completion
    // 2) DMA in 2nd scan line, post-process previous scan line, wait for completion
    // 3) DMA in 3rd scan line, post-process previous scan line, wait for completion
    // ...
    // n) Post-process last scanline

    // I'm reasonably sure we have enough processing power to post-process scanlines in real time, we should have about 80 us.
    // At 126 MHz each clock cycle is about 8 ns, so we have 10000 instructions to process about 320 bytes, or 31.25 instructions per byte.
    while (true) {
        // "Software-render" vsync detection... I.e., wait for low on csync, usleep for hsync_pulse_time+something, check if we're still low
        // If we are, that's a vsync pulse!
        // This works well enough AFAICT
        while (true) {
            while(gpio_get(CAPTURE_PIN_BASE)); // wait for negative pulse on csync
            sleep_us(10); // hsync negative pulse is about 4.92 us according to oscilloscope, so let's wait a little longer than 4.92 us
            if (!gpio_get(CAPTURE_PIN_BASE)) // we're still low! this must be a vsync pulse
                break;
        }
        for (y = 0; y <= INPUT_VIDEO_HEIGHT_OFFSET_Y; y ++) { // capture and throw away first 40 scanlines, capture without throwing away 41st scanline
            logic_analyser_arm(pio, sm, dma_chan, capture_buf0, buf_size_words, CAPTURE_PIN_BASE, true);
            dma_channel_wait_for_finish_blocking(dma_chan);
        }
        for (y = 1; y < (INPUT_VIDEO_HEIGHT-INPUT_VIDEO_HEIGHT_OFFSET_Y)-1; y += 2) {
            logic_analyser_arm(pio, sm, dma_chan, capture_buf1, buf_size_words, CAPTURE_PIN_BASE, true);
            post_process(rgb_buf + (y-1)*OUTPUT_VIDEO_WIDTH, capture_buf0, buf_size_words);
            dma_channel_wait_for_finish_blocking(dma_chan);

            logic_analyser_arm(pio, sm, dma_chan, capture_buf0, buf_size_words, CAPTURE_PIN_BASE, true);
            post_process(rgb_buf + y*OUTPUT_VIDEO_WIDTH, capture_buf1, buf_size_words);
            dma_channel_wait_for_finish_blocking(dma_chan);
        }
        post_process(rgb_buf + (y-2)*OUTPUT_VIDEO_WIDTH, capture_buf0, buf_size_words);
    }
}

Replace vga_hello/src/main.cpp with the above file and recompile (make program.uf2). Maybe this post will help if you are on something that isn’t Windows and can’t get this to compile.

Explanation

The PIO program is generated in the logic_analyser_init function. Here it is again:

    capture_prog_instr[0] = pio_encode_wait_gpio(false, pin_base);
    capture_prog_instr[1] = pio_encode_wait_gpio(true, pin_base);
    capture_prog_instr[2] = pio_encode_in(pio_pins, pin_count);
    struct pio_program capture_prog = {
            .instructions = capture_prog_instr,
            .length = 3,
            .origin = -1
    };

First we wait for a “false” (low) signal. Then a “true” (high) signal. Then we read. Okay… but that doesn’t make any sense, does it?
No, it doesn’t, but maybe with the following bit of code:

    sm_config_set_wrap(&c, offset+2, offset+2); // do not repeat pio_encode_wait_gpio instructions

sm_config_set_wrap is used to tell the PIOs how to loop the PIO program. And in this case, we loop after we have executed the instruction at offset+2, and we jump to offset+2. The instruction at offset+2 is the “in” instruction. That is, we just keep executing the “in” instruction, except the first time. The first time, we wait for low on CSYNC, then wait for high on CSYNC, and then (as this state means that the CSYNC pulse is over) we keep reading as fast as we can (at the programmed PIO speed).

Results

Let’s take a look at the results. Remember, we’re converting to monochrome, and only looking at the blue channel. Remember that our super lazy “analog frontend” is super lazy, and the potentiometer has to be fine-tuned to get to a sweet spot that allows everything on the screen to be displayed.

The composite signal. Black looking very… let’s call it RGB, is one of the things that motivated me to check if I can get monitor output to work. The other thing is the jailbars. The jailbars are more prominent when showing a dark color.
This is before tuning the capture parameters to ignore HBLANK and VBLANK, so we’re slightly cut off at the bottom and on the right. We’re only feeding into the pin for green here. Everything where blue is at zero intensity is green (top VBLANK and left HBLANK and black characters), and everything where blue is at full intensity, is black. I was running off a slightly wrong pixel clock here. You can see that the boundary between HBLANK green and black is fuzzy. On some scanlines we start a pixel (or fraction of a pixel) early, on others a pixel (or fraction thereof) late. On the next frame, this moves a little. It’s like there’s a somewhat low-frequency wave overlaid over the sync signal. Maybe just our old friend, interference? My CSYNC wire _is_ rather janky. Let’s just say, nothing’s shielded, I’m using a paper clip to get the signal out of the RGB jack, I’m connecting mutiple jumper wires to get to the right length, and the ground wire is crazy long.
And this is what it looks like with the HBLANK and VBLANK front porches ignored, and the pixel clock corrected. (Wait, I still see the horizontal front porch? Must be some qaulity code there.) TBH I have a feeling that the wobbliness increased with the correct pixel clock ;D Um, I’ll get to the bottom of this at some point. (It also looks like we’re ignoring too many scanlines at the top, but that’s okay for now.) Note: the noise you see on the screen isn’t part of the signal, that’s just my camera. This also shows that “m”s don’t look too good. (To my defense, they don’t look too clever on composite either.)
Actually the HBLANK front porches are gone now after I fixed a typo in the code. But it’s still quite wobbly. Maybe not quite as wobbly as in the above video?
Top breadboard converts CSYNC signal to TTL (and there’s some other stuff on there that isn’t used right now). Bottom double breadboard would be large enough for everything, but this sort of grew organically. The “USB POWER” thing is this: https://www.amazon.co.jp/dp/B07XM5FWDW. Super useful tiny power supply that runs off USB! I think I got it cheaper than the current price though. Not shown on this pic, but I run this setup off a small USB power bank, and use the power supply to convert the 5V from USB to 3.3V.
What’s the pen and the eraser doing here? TBH my eyes just tend to filter out junk after a while. So stuff just sort of becomes part of the scenery.

Minor update

Fixing a typo in the code (already fixed above as it made no sense to leave it there) fixed up the signal quite a bit. I also added buttons to fine-tune the pixel clock. This stabilizes the signal significantly. However, hopefully mostly due to the fact that our analog frontend is a bit lame, we get a somewhat fuzzy image, where some pixels change between black and white. I am somewhat tempted to build out the analog frontend properly but before that I think I’ll try my hand at digital RGB, more on that in a later post.

Anyway, here’s the updated code for analog input, with support for two buttons to fine-tune the pixel clock:

#include "include.h"

#include <stdio.h>
#include <stdlib.h>

#include "pico/stdlib.h"
#include "hardware/pio.h"
#include "hardware/dma.h"
#include "hardware/structs/bus_ctrl.h"

const uint CAPTURE_PIN_BASE = 9;
const uint CAPTURE_PIN_COUNT = 10; // CSYNC, 3*R, 3*G, 3*B
const uint INCREASE_BUTTON_PIN = 20;
const uint DECREASE_BUTTON_PIN = 21;

const PIO pio = pio1;
const uint sm = 0;
const uint dma_chan = 8; // 0-7 may be used by VGA library (depending on resolution)

const float PIXEL_CLOCK = 5369.3175f; // datasheet (TMS9918A_TMS9928A_TMS9929A_Video_Display_Processors_Data_Manual_Nov82.pdf) page 3-8 / section 3.6.1 says 5.3693175 MHz (10.73865/2)
// the pixel clock has a tolerance of +-0.005 (i.e. +- 5 KHz), let's add a facility to adjust our hard-coded pixel clock:
const float PIXEL_CLOCK_ADJUSTER = 0.1; // KHz

// from same page on datasheet
// HORIZONTAL                   PATTERN OR MULTICOLOR   TEXT
// HORIZONTAL ACTIVE DISPLAY    256                     240
// RIGHT BORDER                 15                      25
// RIGHT BLANKING               8                       8
// HORIZONTAL SYNC              26                      26
// LEFT BLANKING                2                       2
// COLOR BURST                  14                      14
// LEFT BLANKING                8                       8
// LEFT BORDER                  13                      19
// TOTAL                        342                     342

const uint INPUT_VIDEO_WIDTH = 308; // left blanking + color burst + left blanking + left border + active + right border

// VERTICAL                     LINE
// VERTICAL ACTIVE DISPLAY      192
// BOTTOM BORDER                24
// BOTTOM BLANKING              3
// VERTICAL SYNC                3
// TOP BLANKING                 13
// TOP BORDER                   27
// TOTAL                        262

const uint INPUT_VIDEO_HEIGHT = 240; // top blanking + top border + active + 1/3 of bottom border
const uint INPUT_VIDEO_HEIGHT_OFFSET_Y = 40; // ignore top 40 (top blanking + top border) scanlines
// we're capturing everything there is to see on the horizontal axis, but throwing out most of the border on the vertical axis
// NOTE: other machines probably have different blanking/border periods

const uint CAPTURE_N_SAMPLES = INPUT_VIDEO_WIDTH;

const uint OUTPUT_VIDEO_WIDTH = 320;
const uint OUTPUT_VIDEO_HEIGHT = 200;

static_assert(OUTPUT_VIDEO_WIDTH >= INPUT_VIDEO_WIDTH);
static_assert(OUTPUT_VIDEO_HEIGHT >= INPUT_VIDEO_HEIGHT-INPUT_VIDEO_HEIGHT_OFFSET_Y);

uint offset; // Lazy global variable; this holds the offset of our PIO program

// Draw box
ALIGNED u8 rgb_buf[OUTPUT_VIDEO_WIDTH*OUTPUT_VIDEO_HEIGHT];

static inline uint bits_packed_per_word(uint pin_count) {
    // If the number of pins to be sampled divides the shift register size, we
    // can use the full SR and FIFO width, and push when the input shift count
    // exactly reaches 32. If not, we have to push earlier, so we use the FIFO
    // a little less efficiently.
    const uint SHIFT_REG_WIDTH = 32;
    return SHIFT_REG_WIDTH - (SHIFT_REG_WIDTH % pin_count);
}

void logic_analyser_init(PIO pio, uint sm, uint pin_base, uint pin_count, float div) {
    // Load a program to capture n pins. This is just a single `in pins, n`
    // instruction with a wrap.
    static bool already_initialized_once = false;
    uint16_t capture_prog_instr[3];
    capture_prog_instr[0] = pio_encode_wait_gpio(false, pin_base);
    capture_prog_instr[1] = pio_encode_wait_gpio(true, pin_base);
    capture_prog_instr[2] = pio_encode_in(pio_pins, pin_count);
    struct pio_program capture_prog = {
            .instructions = capture_prog_instr,
            .length = 3,
            .origin = -1
    };
    if (already_initialized_once) {
        pio_remove_program(pio, &capture_prog, offset);
    }
    offset = pio_add_program(pio, &capture_prog);
    already_initialized_once = true;

    // Configure state machine to loop over this `in` instruction forever,
    // with autopush enabled.
    pio_sm_config c = pio_get_default_sm_config();
    sm_config_set_in_pins(&c, pin_base);
    sm_config_set_wrap(&c, offset+2, offset+2); // do not repeat pio_encode_wait_gpio instructions
    sm_config_set_clkdiv(&c, div);
    // Note that we may push at a < 32 bit threshold if pin_count does not
    // divide 32. We are using shift-to-right, so the sample data ends up
    // left-justified in the FIFO in this case, with some zeroes at the LSBs.
    sm_config_set_in_shift(&c, true, true, bits_packed_per_word(pin_count)); // push when we have reached 32 - (32 % pin_count) bits (27 if pin_count==9, 30 if pin_count==10)
    sm_config_set_fifo_join(&c, PIO_FIFO_JOIN_RX); // TX not used, so we can use everything for RX
    pio_sm_init(pio, sm, offset, &c);
}

void logic_analyser_arm(PIO pio, uint sm, uint dma_chan, uint32_t *capture_buf, size_t capture_size_words,
                        uint trigger_pin, bool trigger_level) {
    // TODO: disable interrupts
    pio_sm_set_enabled(pio, sm, false);
    // Need to clear _input shift counter_, as well as FIFO, because there may be
    // partial ISR contents left over from a previous run. sm_restart does this.
    pio_sm_clear_fifos(pio, sm);
    pio_sm_restart(pio, sm);

    dma_channel_config c = dma_channel_get_default_config(dma_chan);
    channel_config_set_read_increment(&c, false);
    channel_config_set_write_increment(&c, true);
    channel_config_set_dreq(&c, pio_get_dreq(pio, sm, false)); // pio_get_dreq returns something the DMA controller can use to know when to transfer something

    dma_channel_configure(dma_chan, &c,
        capture_buf,        // Destination pointer
        &pio->rxf[sm],      // Source pointer
        capture_size_words, // Number of transfers
        true                // Start immediately
    );

    pio_sm_exec(pio, sm, pio_encode_jmp(offset)); // just restarting doesn't jump back to the initial_pc AFAICT
    pio_sm_set_enabled(pio, sm, true);
}

void blink(uint32_t ms=500)
{
    gpio_put(PICO_DEFAULT_LED_PIN, true);
    sleep_ms(ms);
    gpio_put(PICO_DEFAULT_LED_PIN, false);
    sleep_ms(ms);
}

// uint8_t msb_table_inverted[8] = { 3, 3, 3, 3, 2, 2, 1, 0 };
uint8_t msb_table_inverted[8] = { 0, 1, 2, 2, 3, 3, 3, 3 };

void post_process(uint8_t *rgb_bufy, uint32_t *capture_buf, uint buf_size_words)
{
    uint16_t i, j, k;
    uint32_t temp;
    for (i = 8, j = 0; i < buf_size_words; i++, j += 3) { // start copying at pixel 24 (8*3) (i.e., ignore left blank and color burst, exactly 24 pixels).
        temp = capture_buf[i] >> (2+1); // 2: we're only shifting in 30 bits out of 32, 1: ignore csync
        rgb_bufy[j] = msb_table_inverted[temp & 0b111]; // red
        rgb_bufy[j] |= (msb_table_inverted[(temp & 0b111000) >> 3] << 3); // green
        rgb_bufy[j] |= (msb_table_inverted[(temp & 0b111000000) >> 6] << 6); // blue
        temp >>= 10; // go to next sample, ignoring csync
        rgb_bufy[j+1] = msb_table_inverted[temp & 0b111]; // red
        rgb_bufy[j+1] |= (msb_table_inverted[(temp & 0b111000) >> 3] << 3); // green
        rgb_bufy[j+1] |= (msb_table_inverted[(temp & 0b111000000) >> 6] << 6); // blue
        temp >>= 10; // go to next sample, ignoring csync
        rgb_bufy[j+2] = msb_table_inverted[temp & 0b111]; // red
        rgb_bufy[j+2] |= (msb_table_inverted[(temp & 0b111000) >> 3] << 3); // green
        rgb_bufy[j+2] |= (msb_table_inverted[(temp & 0b111000000) >> 6] << 6); // blue
    }
}

void adjust_pixel_clock(float adjustment) {
    static absolute_time_t last_adjustment = { 0 };
    static float pixel_clock_adjustment = 0.0f;
    absolute_time_t toc = get_absolute_time();
    if (absolute_time_diff_us(last_adjustment, toc) > 250000) {
        pio_sm_set_enabled(pio, sm, false);
        pixel_clock_adjustment += adjustment;
        last_adjustment = toc;
        logic_analyser_init(pio, sm, CAPTURE_PIN_BASE, CAPTURE_PIN_COUNT, ((float)Vmode.freq)/(PIXEL_CLOCK+pixel_clock_adjustment));
    }
}

int main()
{
    uint16_t i, y;

    gpio_init(PICO_DEFAULT_LED_PIN);
    gpio_init(CAPTURE_PIN_BASE);
    gpio_set_dir(PICO_DEFAULT_LED_PIN, GPIO_OUT);
    gpio_set_dir(CAPTURE_PIN_BASE, GPIO_IN);

    blink();

    // initialize videomode
    Video(DEV_VGA, RES_CGA, FORM_8BIT, rgb_buf);

    blink();

    // We're going to capture into a u32 buffer, for best DMA efficiency. Need
    // to be careful of rounding in case the number of pins being sampled
    // isn't a power of 2.
    uint total_sample_bits = CAPTURE_N_SAMPLES * CAPTURE_PIN_COUNT;
    total_sample_bits += bits_packed_per_word(CAPTURE_PIN_COUNT) - 1;
    uint buf_size_words = total_sample_bits / bits_packed_per_word(CAPTURE_PIN_COUNT);
    uint32_t *capture_buf0 = (uint32_t*)malloc(buf_size_words * sizeof(uint32_t));
    hard_assert(capture_buf0);
    uint32_t *capture_buf1 = (uint32_t*)malloc(buf_size_words * sizeof(uint32_t));
    hard_assert(capture_buf1);

    blink();

    // Grant high bus priority to the DMA, so it can shove the processors out
    // of the way. This should only be needed if you are pushing things up to
    // >16bits/clk here, i.e. if you need to saturate the bus completely.
    // (Didn't try this)
//     bus_ctrl_hw->priority = BUSCTRL_BUS_PRIORITY_DMA_W_BITS | BUSCTRL_BUS_PRIORITY_DMA_R_BITS;

    logic_analyser_init(pio, sm, CAPTURE_PIN_BASE, CAPTURE_PIN_COUNT, (float)Vmode.freq/PIXEL_CLOCK);

    blink();

    // 1) DMA in 1st scan line, wait for completion
    // 2) DMA in 2nd scan line, post-process previous scan line, wait for completion
    // 3) DMA in 3rd scan line, post-process previous scan line, wait for completion
    // ...
    // n) Post-process last scanline

    // I'm reasonably sure we have enough processing power to post-process scanlines in real time, we should have about 80 us.
    // At 126 MHz each clock cycle is about 8 ns, so we have 10000 instructions to process about 320 bytes, or 31.25 instructions per byte.
    while (true) {
        // "Software-render" vsync detection... I.e., wait for low on csync, usleep for hsync_pulse_time+something, check if we're still low
        // If we are, that's a vsync pulse!
        // This works well enough AFAICT
        while (true) {
            while(gpio_get(CAPTURE_PIN_BASE)); // wait for negative pulse on csync
            sleep_us(10); // hsync negative pulse is about 4.92 us according to oscilloscope, so let's wait a little longer than 4.92 us
            if (!gpio_get(CAPTURE_PIN_BASE)) // we're still low! this must be a vsync pulse
                break;
        }
        for (y = 0; y <= INPUT_VIDEO_HEIGHT_OFFSET_Y; y ++) { // capture and throw away first 40 scanlines, capture without throwing away 41st scanline
            logic_analyser_arm(pio, sm, dma_chan, capture_buf0, buf_size_words, CAPTURE_PIN_BASE, true);
            dma_channel_wait_for_finish_blocking(dma_chan);
        }
        for (y = 1; y < (INPUT_VIDEO_HEIGHT-INPUT_VIDEO_HEIGHT_OFFSET_Y)-1; y += 2) {
            logic_analyser_arm(pio, sm, dma_chan, capture_buf1, buf_size_words, CAPTURE_PIN_BASE, true);
            post_process(rgb_buf + (y-1)*OUTPUT_VIDEO_WIDTH, capture_buf0, buf_size_words);
            dma_channel_wait_for_finish_blocking(dma_chan);

            logic_analyser_arm(pio, sm, dma_chan, capture_buf0, buf_size_words, CAPTURE_PIN_BASE, true);
            post_process(rgb_buf + y*OUTPUT_VIDEO_WIDTH, capture_buf1, buf_size_words);
            dma_channel_wait_for_finish_blocking(dma_chan);
        }
        post_process(rgb_buf + (y-2)*OUTPUT_VIDEO_WIDTH, capture_buf0, buf_size_words);

        if (gpio_get(INCREASE_BUTTON_PIN)) {
            adjust_pixel_clock(PIXEL_CLOCK_ADJUSTER); // + some Hz
        } else if (gpio_get(DECREASE_BUTTON_PIN)) {
            adjust_pixel_clock(-PIXEL_CLOCK_ADJUSTER); // - some Hz
        }
    }
}
I think my camera doesn’t like this type of scene. It doesn’t look perfect in real life, but not this bad. :p I swear! The noise isn’t there, for example. You can see that the AOC logo is kind of wobbly too, and obviously it isn’t in real life. (I’ll try with a different camera, or without image stabilization next time.)