HP35: A Bit-Serial Wonder – 2. Bit What?

Part2. Bit What?

This article is about the “bit-serial” design scheme employed in HP35’s digital design.

[Status: Proofreading]

(The top image is a part of a dieshot of one HP35 ROM circuit. More specifically the IA buffer section along with some control circuits.)

(The dieshots in the series of articles are from pmonta.com. He is the first person to release a full ROM dump of the HP35: By doing it optically.)

“Bits” of a CPU

In the computer history, engineers are carefully fighting against many restrictions to minimize manufacturing cost. In the 70s, transistors are extremely expansive. Serial architecture, a digital design scheme that’s considered ridiculous today is a popular thing back then.

A typical CPU today is 64bit, they can do 64-bit arithmetic “in one go” and store the intermediate result internally. I think we can argue about what “bits” of a CPU means. This term is not very clear. Most (eg. Wikipedia) says it means “word” size. Or the datapath and main register’s width. But if you take a look at any processor that’s not the simple toy-processor model presented in your book: 64bit Intel processors can also do 128-bit arithmetic; 4004 has a 4-bit ALU but anything else is presented in 8-bit fashion. We call these complicated architecture “mixed architecture”, and call their bit-number in the way that represents the architecture the best.

Our today’s protagonist, HP35, has a 1-bit ALU. Its minimum unit of an arithmetic operation is 4-bit, the maximum unit of an arithmetic operation is 56-bit, the instructions are 10-bit, and the addresses are 8-bit. All mentioned above can be processed in one machine cycle, but a machine cycle of HP35 takes 56 clock cycles. What a mess! How many “bits” are there?

Identically, the HP35 architecture processes a single bit each clock cycle, every databus is just a single line. It should be called a “1-bit architecture”. But how does it work? Why can such a design dramatically save transistor count? Will it be super slow to run? We’ll discover it in this article.

Shift Register Transfer

The way to do parallel register transfer is trivial. If one would describe it in HDL (Don’t worry if you’re not familiar with Verilog HDL, the code is self-explanatory), it’s just:

always @ (posedge clock) reg1 <= reg2;

But how to do register transfer serially?
If you’ve ever tried using SPI bus in a microcontroller system, you should already be familiar with the serial register transfer scheme. (assume each register is 8-bit)

always @ (posedge clock) begin
    reg1 <= {reg1[0], reg1[7:1]};    // Circulate reg1
    reg2 <= {reg1[0], reg2[7:1]};    // Direct reg1's LSB to reg2's entry

transfers the content from reg1 to reg2

always @ (posedge clock) begin
    reg1 <= {reg2[0], reg1[7:1]}; // Direct reg2's LSB to reg1's entry
    reg2 <= {reg1[0], reg2[7:1]};  // Direct reg1's LSB to reg2's entry

exchanges the content of reg1 and reg2

It’s nothing difficult to understand if you have some basic FF design knowledge. We can depict it in this way:

Fig 1. Register Transfer between Two Registers

Notice the “Q”s before and after 8 clocks. Initial content of reg1 is 8’b10010010, reg2 is 8’b01010100

With proper gating signal employed, the designer can do any transfer freely in the register file of a CPU.

Bit by Bit Arithmetic

Bit-by-bit arithmetic sounds cool, it’s nothing hard either. How about computing 5+28 on a piece of paper? It’s just about adding 5 to 8 and add the carry to 20, the process is 5+8=10+3, 20+10=30 thus 5+28 = 33. In this process, only a single decimal digit was operated on each time, and the “carry” was used to chain multiple digits together.

And so-called “bit-serial” arithmetic is just about having a 1 binary bit ALU. We compute bit-by-bit and use the carry bit to chain them together.

To better understand this, we need to recall how binary arithmetic is done in the first place:

Fig 2. A Typical “Full Adder”

With a full adder, one can do 1-bit with-carry arithmetic. In a normal architecture, we chain those adders together to create ALU

Fig 3. Two Parallel 4-Bit Adders

Here’s the typical implementation. The one on the left is simply 4 full adders with their carry in-out chained together. The one on the right is a complex “Carry Look Ahead” structure, which eliminates carry propagation to minimize delay.

Now, how about serial adder? To make a serial adder, we just need to save the current carry as the next cycle’s carry in with a flip-flop. Just feed in data one bit a clock, and then collect the output.

Here’s a serial +1 adder. Since we’re only adding 1, the logic can be even simpler (This, in fact, is just a half-adder)

Fig 4. An Adder Adding 1 to 8’b10010010

The OR gate and the 3-bit counter are used to generate an “initial 1 carry” specifically for the +1 operation. At the first clock (counter=0), the carry bit is forced 1 and soon released to behave like a normal carry bit. That’s the way to implement “reg1 + 1 -> reg1”.

In the same fashion, we can design tons of different operations for our 1-bit ALU. Here’s an annotated code segment from Systemyde’s implementation of the opcode decoder for the ALU of the HP41 NUT processor (the descendant of HP35’s processor).

casex (inst_reg)
    10'b1101110000: alu_out = a_reg[0] || c_reg[0];          // C=C OR A
    10'b1110110000: alu_out = a_reg[0] && c_reg[0];          // C=C AND A
    10'b01001xxx10,                                          // A=A+B
    10'b01100xxx10,                                          // A=A-B
    10'b11001xxx10: alu_out = a_reg[0] ^ b_reg[0] ^ cry_zer; // ?A<B (A-B<0?)
    10'b01x10xxx10,                                          // A=A+C, A=A-C
    10'b100x0xxx10,                                          // C=A+C, C=A-C
    10'b11000xxx10: alu_out = a_reg[0] ^ c_reg[0] ^ cry_zer; // ?A<C (A-C<0?)
    10'b01011xxx10,                                          // A=A+1
    10'b01101xxx10: alu_out = a_reg[0] ^ cry_one;            // A=A-1
    10'b100x1xxx10,                                          // C=C+1, C=C-1
    10'b10101xxx10: alu_out = c_reg[0] ^ cry_one;            // C=-C-1
    10'b10100xxx10: alu_out = c_reg[0] ^ cry_zer;            // C=-C
    10'b01111xxx10: alu_out = cry_zer;                       // C=C+C
    default:      alu_out = 1'b0;

casex (inst_reg)
    10'b01001xxx10: alu_cry = (b_reg[0] && cry_zer) ||
                      ( a_reg[0] && (b_reg[0] || cry_zer));  // A=A+B
    10'b01010xxx10,                                          // A=A+C
    10'b10000xxx10: alu_cry = (c_reg[0] && cry_zer) ||
                      ( a_reg[0] && (c_reg[0] || cry_zer));  // C=C+A
    10'b01011xxx10: alu_cry =  a_reg[0] && cry_one;          // A=A+1
    10'b01100xxx10,                                          // A=A-B
    10'b11001xxx10: alu_cry = (b_reg[0] && cry_zer) ||            
                      (!a_reg[0] && (b_reg[0] || cry_zer));  // ?A<B
    10'b01101xxx10: alu_cry = !a_reg[0] && cry_one;          // A=A-1
    10'b01110xxx10,                                          // A=A-C
    10'b10010xxx10,                                          // C=A-C
    10'b11000xxx10: alu_cry = (c_reg[0] && cry_zer) ||
                      (!a_reg[0] && (c_reg[0] || cry_zer));  // ?A<C (A-C<0?)
    10'b01111xxx10: alu_cry =  c_reg[0];                     // C=C+C
    10'b10001xxx10: alu_cry =  c_reg[0] && cry_one;          // C=C+1
    10'b10011xxx10: alu_cry = !c_reg[0] && cry_one;          // C=C-1
    10'b10111xxx10: alu_cry =  c_reg[0] || cry_zer;          // ?C#0
    10'b10101xxx10: alu_cry =  c_reg[0] || cry_one;          // C=-C-1
    10'b10100xxx10: alu_cry =  c_reg[0] || cry_zer;          // C=-C
    10'b10110xxx10: alu_cry =  b_reg[0] || cry_zer;          // ?B#0
    10'b11010xxx10: alu_cry =  a_reg[0] || cry_zer;          // ?A#0
    10'b11011xxx10: alu_cry = (a_reg[0] ^ c_reg[0]) || cry_zer;// ?A#C
    default:      alu_cry = 1'b0;

cry_zer is the initial 0 carry, cry_one is the initial 1 carry. a_reg, b_reg, c_reg are the arithmetic registers mentioned before.

By looking at the encoding of the opcodes I’m convinced that there’s a cleaner implementation. (eg. each bit is responsible for some specific gating signals. The code segment shown here is a brutal solution that’s made possible by modern technology). But I don’t have the capability to dig further in.

This is just a brief look at bit-serial arithmetic. About the way to implement shifting, BCD adjustment and more is the topic for the ARC article.

Bit-Serial Machine

If we combine the two elements that we’ve depicted above, we’re now able to build a pure bit-serial machine.

Fig 5. Fully Serial, It’s just about shifting.

(The image is from the HP45 patent document)

In each cycle, the CTC shift out the address of the required instruction. After acquiring the address, the ROM chips shift out the corresponding data to CTC and ARC. CTC and ARC then respond to the instruction.

Why is the design the way to go in the 70s? When it comes to transistor count or silicon cost, the bit-serial architecture was unbeatable.

Fig 6. HP35 ARC Dieshot. 

(the Full-size original image is on pmonta.com)

As can be seen from the dieshot, the register file on the top right is very dense.(In the shiny square, the three shorter columns on the left are reg A, B, and C, the longer four on the right are D, E, F, and M) In the design nearly every register is serial. Such a choice has made the structure very simple thus saved wiring space as well as transistor count.

In one machine cycle, activated registers in ARC and CTC will cycle through the internal busses once or twice, allowing other circuits to read them or write to them bit-by-bit.

Fig 7. A closer look. The inlets shown here belongs to reg D, E, F, and M.

Only single lines are required for each register. That’s a key feature of such a design.

Fig 8. The Combined Mask Drawing of Intel4004 (source 4004.com)

Most microchips of that era used parallel databases and high-degree of multiplexing to achieve the same goal. The Intel4004 employed a 4-bit parallel architecture. The 4-bit parallel internal data bus that’s around the chip is obvious (On the metal layer, which is colored blue in this drawing). 4004 is known to be the origin of all modern microprocessors due to its clean, complete yet minimalistic design. The MCS-4 architecture is highly expandable but vastly unsuitable if it’s about pocket calculating.

Why bother doing this?

One obvious benefit of using serial architecture is that the bit-width of signals inside the chips are reduced to the minimum. As can be seen in Fig 6., most of the space on that chip is actually used-up by wiring but not the logic gates themselves. A multi-bit bus can quickly become a routing nightmare when you have only ONE usable metal layer (Plus a resistive polysilicon layer for cross wiring).

But there’s actually a huge benefit that most modern logic designers are not aware of: Serial arithmetic can be implemented using purely dynamic circuit very easily. Here, “dynamic” carries the same meaning as the D in DRAM: those “bits” are stored not in flip-flops (some kind of bi-stable circuit) but in capacitors. Modern DRAMs need to be “refreshed” periodically to retain its data; on those earlier serial logic implementations, however, the action of refreshing is happening on its own, every clock cycle.

With almost all important registers implemented using dynamic logic, HP calculators and a lot of its bit-serial ancestors are able to achieve incredible information density with such primitive technologies.

Fig 9. Delay Line Memory [Calculator Memory Technologies (vintagecalculators.com)]

Believe it or not, the design philosophy of HP35’s dynamic logic is exactly the same as what is called “Delay Line Memories”. Fig 9. shows a digital memory built using a piece of piano wire, where the data is retained in the form of travelling waves. This design philosophy can even be seen in some earlier camcorders, where a piece of crystal is used as a delay line memory to temporary store analog video data.

The Elegance of HP35 System Design

The HP35 has a very long machine cycle: each machine cycle requires 56 clocks. Why it’s not shorter or longer.In my opinion, the HP engineers had tried all they could to squeeze as much as they could into one machine cycle and keep the clock per machine cycle as less as possible. The best part is in the CTC, which will be discussed in a later article. First, let’s take a look at the timing diagram:


  • Is: Instruction fetched from ROM chips. A bump at T11 turns that digit displayed into a sign symbol. I’ll take about the display design in a separate article.
  • Ia: Address given by the CTC chip.
  • WS: Word Select control line
  • SYNC: IS valid and counter reset signal.

There’re some other wires that are not shown in the diagram:

  • KR: Keyboard Row
  • KC: Keyboard Column
  • Data: The display data generated based on register A, B by ARC and received by the Anode Driver chip
  • Start: The digit time 0 indication signal generated by ARC.
  • Carry: The carry flag generated by the ARC, received by CTC for branching.

Phew, that’s dense. All lines shown are just lines, not busses. That’s the first point and was the key point (as discussed in the HP Journal 1972 issue 6): to keep the wiring, internal or external, the minimum. In later models (HP41), the IS and IA are combined into a single “ISA” line.

Secondly, the processor does 2-stage pipelining. Executing and fetching can be done in a single cycle (there’s no data dependency issue here).

Now let’s look into it. We have two timing units here: Digit Time and Cycle Time, each Digit Time equals to 4 Cycle Time. In each Cycle Time, the bit-serial logic processes one bit, In each Digit Time, the ALU in ARC calculates one BCD digit(4-bits). We have in total 14 Digit Times so a 56bit arithmetic operation can be done in one Machine Cycle. Now how to implement the WS feature is obvious: Just ignore the ALU’s output and directly pass thru the input when the digit is not WS’d.

We have an alignment issue here, the original design has a 2-phase clock to eliminate cross-chip clock jitter problem. So which clock do all those signals mentioned aligns to? Based on the data captured on my DSLogic LA, I can give a rough answer (Raw data has been uploaded to my Github ):

        Driver      Set by          Sampled by
Start:  ARC         negedge CPH2    CPH1
SYNC:   CTC         negedge CPH2    ?
IS:     ROM         posedge CPH2    CPH1
IA:     CTC         posedge CPH2    CPH1
WS:     CTC||ROM    masked CPH2     CPH1
Carry:  ARC         negedge CPH2    ?
STEP:   ANODE       the origin of CPH1?

The IA and IS position seems weird. In fact, it was designated based on the bit-serial implementation of the original CTC chip. Those signals are only valid during those time periods to allow many branching features to be implemented. We’ll see it when we get into the CTC chip in the CTC article.

The later models followed the 56-clock machine cycle design. HP engineers attempted to squeeze much more features into the system (eg. RAM access, Magnetic card reading, Storing two return addresses and more) with no extra cycle and little extra wiring required. It has proven that the non-standard architecture is both feature-packed and extendable.

2 thoughts on “HP35: A Bit-Serial Wonder – 2. Bit What?”

Leave a Reply