8 minute read

Here’s the question I used to use at Invitae to interview bioinformatics engineering candidates, how it’s solved by two reasoning LLMs, and a few conclusions.

Interview Question

I have a fasta file that has the following format:

>SEQ1
AGTCTGCTACGCTAGC
>SEQ2
GGTGCGATCGATCGTG
>SEQ3
TTCCCATGCTACGTCG
>SEQ4
TCAAACGCATCGCGAC

You can assume that:

  • the fasta file is well formatted
  • there is only one sequence line per header line
  • each sequence in the file is the same length

Your task is to replace the pass portions of this code with functioning code, according to the name and docstring:

def read_fasta(input_fasta_path):
    """Given the location of a fasta file, return a list of sequences"""
    pass


def are_similar(seq1, seq2, threshold=3):
    "Given two sequences, using a Hamming distance metric to return a bool indicating whether or not the two sequences are similar"
    pass


def build_graph(sequences):
    """Given a list of sequences, return a graph representation of sequences that are similar via Hamming distance"""
    pass


def are_second_degree_neighbors(graph, seq1, seq2):
    """Given a graph and two sequences, return a bool indicating if they're second degree neighbors""" 
    pass


def main():
   pass

Please create any tests and data you need in order to feel confident that the code you write is correct. Similarly, structure main in any way you see fit to demonstrate correctness of the code.

Lastly, imagine we want this code to go into production. Please attempt to make coding decisions that produce production-level code.

ChatGPT-o1

(Thought for 29s)

#!/usr/bin/env python3
"""
FASTA Similarity Checker

This module provides functionality to:
1. Read sequences from a FASTA file.
2. Compute a Hamming-based similarity check between two sequences.
3. Build a graph (adjacency list) of sequences considered "similar."
4. Determine whether two sequences are second-degree neighbors within that graph.

Author: Your Name
Date: 2025-01-31
"""

from itertools import combinations
import sys


def hamming_distance(seq1: str, seq2: str) -> int:
    """
    Compute the Hamming distance between two equal-length sequences.
    
    :param seq1: First sequence (string).
    :param seq2: Second sequence (string).
    :return: The Hamming distance (integer).
    :raises ValueError: If the sequences are not the same length.
    """
    if len(seq1) != len(seq2):
        raise ValueError("Sequences must be of equal length to compute Hamming distance.")
    return sum(ch1 != ch2 for ch1, ch2 in zip(seq1, seq2))


def read_fasta(input_fasta_path: str) -> list[str]:
    """
    Given the location of a fasta file, return a list of sequences.
    Assumptions:
      - The FASTA file is well-formatted.
      - There is only one sequence line per header line.
      - Each sequence is of the same length.
    
    :param input_fasta_path: Path to input FASTA file.
    :return: A list of sequences (strings).
    """
    sequences = []
    
    with open(input_fasta_path, "r") as f:
        for line in f:
            line = line.strip()
            if not line:
                continue
            # Skip headers that start with '>'
            if line.startswith(">"):
                continue
            sequences.append(line)

    return sequences


def are_similar(seq1: str, seq2: str, threshold: int = 3) -> bool:
    """
    Given two sequences, use a Hamming distance metric to return a bool
    indicating whether the two sequences are 'similar.'
    
    :param seq1: First sequence (string).
    :param seq2: Second sequence (string).
    :param threshold: Maximum Hamming distance for which sequences are considered similar.
    :return: True if Hamming distance <= threshold, False otherwise.
    """
    dist = hamming_distance(seq1, seq2)
    return dist <= threshold


def build_graph(sequences: list[str], threshold: int = 3) -> dict[str, set[str]]:
    """
    Given a list of sequences, return a graph (adjacency list) of sequences
    that are similar with respect to a given Hamming distance threshold.
    
    :param sequences: A list of sequences (strings).
    :param threshold: Maximum Hamming distance allowed to declare two sequences connected.
    :return: A dict where keys are sequences and values are sets of adjacent (similar) sequences.
    """
    graph = {seq: set() for seq in sequences}
    
    for seq1, seq2 in combinations(sequences, 2):
        if are_similar(seq1, seq2, threshold):
            graph[seq1].add(seq2)
            graph[seq2].add(seq1)

    return graph


def are_second_degree_neighbors(graph: dict[str, set[str]], seq1: str, seq2: str) -> bool:
    """
    Given a graph (as returned by build_graph) and two sequences, return a bool
    indicating if they are second-degree neighbors.
    
    Here, "second-degree neighbors" means that seq1 is not directly connected to seq2,
    but there exists some intermediate neighbor 'n' such that seq1 -> n -> seq2 is a path.
    
    :param graph: The adjacency list (dict of sets) of sequences.
    :param seq1: The first sequence (string).
    :param seq2: The second sequence (string).
    :return: True if seq2 can be reached from seq1 by exactly two "hops." 
    """
    # If either sequence is not in the graph, cannot be neighbors
    if seq1 not in graph or seq2 not in graph:
        return False
    
    # If they're first-degree neighbors, they're not strictly second-degree neighbors
    if seq2 in graph[seq1]:
        return False

    # Check for an intermediate neighbor
    for neighbor in graph[seq1]:
        if seq2 in graph[neighbor]:
            return True
    
    return False


def test_all():
    """
    A simple test suite to ensure correctness of the code.
    """
    # Test Hamming distance
    assert hamming_distance("AAAA", "AAAT") == 1
    assert hamming_distance("AAAA", "AAAA") == 0
    try:
        _ = hamming_distance("AAA", "AAAA")
    except ValueError:
        pass  # expected

    # Test are_similar
    assert are_similar("AAAA", "AAAT", threshold=1) is True
    assert are_similar("AAAA", "AATT", threshold=1) is False
    
    # Build a small graph and test adjacency
    seqs = ["AAAA", "AAAT", "AATT", "TTTT"]
    g = build_graph(seqs, threshold=1)
    # neighbors of "AAAA" should only be "AAAT"
    assert g["AAAA"] == {"AAAT"}
    # neighbors of "AAAT" should be "AAAA" and "AATT"
    assert g["AAAT"] == {"AAAA", "AATT"}
    
    # Test are_second_degree_neighbors
    # "AAAA" -> "AAAT" -> "AATT" => second-degree neighbors
    assert are_second_degree_neighbors(g, "AAAA", "AATT") is True
    # But "AAAA" and "TTTT" are not even close (distance=4 each from "AAAA")
    assert are_second_degree_neighbors(g, "AAAA", "TTTT") is False
    
    print("All tests passed!")


def main():
    """
    Demonstration of how to use the above functions in practice.
    """
    # 1. Run tests to ensure code works correctly
    test_all()
    
    # 2. Example usage with a sample FASTA file (replace 'example.fasta' with a real file path)
    fasta_file = "example.fasta"

    try:
        sequences = read_fasta(fasta_file)
    except FileNotFoundError:
        print(f"Error: FASTA file '{fasta_file}' not found.")
        sys.exit(1)
    
    print("Sequences read from FASTA file:")
    for s in sequences:
        print(s)
    
    # 3. Build the similarity graph
    graph = build_graph(sequences, threshold=3)
    
    # 4. Demonstrate neighbor checks (customize seq1, seq2 to your input)
    if len(sequences) >= 2:
        seq1 = sequences[0]
        seq2 = sequences[1]
        print(f"\nChecking if '{seq1}' and '{seq2}' are similar (threshold=3): {are_similar(seq1, seq2, 3)}")
        print(f"Are '{seq1}' and '{seq2}' second-degree neighbors? {are_second_degree_neighbors(graph, seq1, seq2)}")
    else:
        print("Not enough sequences to demonstrate neighbor checks.")


if __name__ == "__main__":
    main()

Running the script

I used the following test file as example.fasta:

>SEQ1
AAAAAAATTTTTTTCCCCCCGGGGGGG
>SEQ2
AAAAAAATTTTTTTCCCCCCGGGTTTT
>SEQ3
GGGGGGGTTTTTTTCCCCCCGGGGGGG
>SEQ4
AAAAAAATTTTTTTCCCCCCGGGGGTT
>SEQ5
AAAAAAATTTTTTTCCCCCCGGGGGGC

I’d expect the o1 script to say that the first two sequences are neighbors.

❯ python chatGPT_o1.py
All tests passed!
Sequences read from FASTA file:
AAAAAAATTTTTTTCCCCCCGGGGGGG
AAAAAAATTTTTTTCCCCCCGGGTTTT
GGGGGGGTTTTTTTCCCCCCGGGGGGG
AAAAAAATTTTTTTCCCCCCGGGGGTT
AAAAAAATTTTTTTCCCCCCGGGGGGC

Checking if 'AAAAAAATTTTTTTCCCCCCGGGGGGG' and 'AAAAAAATTTTTTTCCCCCCGGGTTTT' are similar (threshold=3): False
Are 'AAAAAAATTTTTTTCCCCCCGGGGGGG' and 'AAAAAAATTTTTTTCCCCCCGGGTTTT' second-degree neighbors? True

Deepseek R1

(Thought for 176s)

from collections import defaultdict


def read_fasta(input_fasta_path):
    """Given the location of a fasta file, return a list of sequences"""
    sequences = []
    with open(input_fasta_path, "r") as f:
        for line in f:
            if line.startswith(">"):
                # Read the next line as the sequence
                seq = next(f).strip()
                sequences.append(seq)
    return sequences


def are_similar(seq1, seq2, threshold=3):
    """Given two sequences, return True if their Hamming distance is <= threshold."""
    if len(seq1) != len(seq2):
        return False
    hamming_distance = sum(c1 != c2 for c1, c2 in zip(seq1, seq2))
    return hamming_distance <= threshold


def build_graph(sequences):
    """Given a list of sequences, return a graph where keys are sequences and values are lists of similar sequences."""
    graph = defaultdict(list)
    n = len(sequences)
    for i in range(n):
        for j in range(i + 1, n):
            if are_similar(sequences[i], sequences[j]):
                graph[sequences[i]].append(sequences[j])
                graph[sequences[j]].append(sequences[i])
    return graph


def are_second_degree_neighbors(graph, seq1, seq2):
    """Return True if seq1 and seq2 are connected via exactly two edges in the graph."""
    if seq1 == seq2:
        return False  # A sequence cannot be a neighbor of itself in this context

    # Check if they are direct neighbors
    if seq2 in graph.get(seq1, []):
        return False

    # Check if there's any common neighbor that connects them
    for neighbor in graph.get(seq1, []):
        if seq2 in graph.get(neighbor, []):
            return True
    return False


def main():
    # Example test case with a small FASTA file
    test_fasta = [">SEQ1", "AAAA", ">SEQ2", "AAAT", ">SEQ3", "AATT", ">SEQ4", "ATTT"]
    # Writing test data to a temporary file
    with open("test.fasta", "w") as f:
        f.write("\n".join(test_fasta))

    # Read the sequences
    sequences = read_fasta("test.fasta")
    print(f"Read {len(sequences)} sequences: {sequences}")

    # Build the graph
    graph = build_graph(sequences)
    print("Graph edges:")
    for seq, neighbors in graph.items():
        print(f"{seq}: {neighbors}")

    # Test are_similar
    print("\nTesting are_similar:")
    print("SEQ1 and SEQ2 (threshold=1):", are_similar(sequences[0], sequences[1], 1))  # True
    print("SEQ1 and SEQ3 (threshold=1):", are_similar(sequences[0], sequences[2], 1))  # False

    # Test are_second_degree_neighbors
    print("\nTesting are_second_degree_neighbors:")
    print("SEQ1 and SEQ3:", are_second_degree_neighbors(graph, sequences[0], sequences[2]))  # True
    print("SEQ1 and SEQ4:", are_second_degree_neighbors(graph, sequences[0], sequences[3]))  # False


if __name__ == "__main__":
    main()

It’s clever that it chooses to write out then read in a file in main.

If we just run the script, we get the output:

❯ python deepseek_R1.py
Read 4 sequences: ['AAAA', 'AAAT', 'AATT', 'ATTT']
Graph edges:
AAAA: ['AAAT', 'AATT', 'ATTT']
AAAT: ['AAAA', 'AATT', 'ATTT']
AATT: ['AAAA', 'AAAT', 'ATTT']
ATTT: ['AAAA', 'AAAT', 'AATT']

Testing are_similar:
SEQ1 and SEQ2 (threshold=1): True
SEQ1 and SEQ3 (threshold=1): False

Testing are_second_degree_neighbors:
SEQ1 and SEQ3: False
SEQ1 and SEQ4: False

Conclusions

  • Both of these models pass the test. By comparison, probably 1/3 people I interviewed would pass. Many bioinformaticians get bogged down in even simple data structure / algorithm style questions.
  • There’s a little room for ambiquity in the question. I used to spend a lot of time in the interviews answering questions and clarifying that ambiquity. Both models make reasonable choices.
  • I like to see both itertools and collections utilized, as a demonstration of knowledge of the standard library. It’s interesting that each model uses one but not the other.
  • o1’s tests were a little more robust, IMO.
  • R1 took 6x the time to solve the problem.
  • I have a slight qualitative preference for o1’s code.

// TODO Comment on the set approach for 2nd degree neighbors

And finally:

  • The next time I interview a BFX role, I should come up with another question. This one is “solved”.

Tags:

Categories:

Updated: