OpenMS Invites the Computational Mass Spectrometry Community to Join Google Summer of Code 2025!

Google Summer of Code (GSoC) 2025 OpenMS is planning to apply as an umbrella organization and we would like to extend an invitation to other projects and groups within the computational mass spectrometry and proteomics/metabolomics communities to join us in this effort. If your project aligns with the goals of GSoC and you are interested in mentoring a student project, we encourage you to submit a short proposals by February 7th 2025 at 23:59 UTC.

Note: Current status: We are no longer accepting project proposals from mentors.

GSoC Contributors

Make sure you are eligible to participate in GSoC 2025.
remember the GSOC 2025 timeline
Read the DOs and DON’Ts document to gauge your interest in participating in this year’s GSoC 2025.
remember the GSOC 2025 timeline
Review the list of themes and the projects available within. If you have specific questions about a project, our mentors are active on Discord and we will happily assist you.
Follow our instructions below on how to submit a proposal to us.

Submitting an Application:

Proposal must be uploaded to the GSoC webpage before the official deadline. Ensure your CV and contact information are included in the proposal document.
We highly recommend to get in touch with the mentors before submitting your proposal.

Available Projects

Theme A) Visualization and User-Friendly Tools

1) Peptide m/z Calculator

Proposed Mentors: Arslan Siraj, Tom Müller
Skills: Python, Git, Streamlit
Estimated Project Length: 90 hours | Difficulty: Easy

In mass spectrometry (MS) proteomics and metabolomics, one common task is to compute the mass-to-charge (m/z) ratio of the analyte so that it can be located in a spectrum. Although the calculation is computationally simple and can easily be performed by the pyopenms package, this simple, commonly used calculation can be cumbersome for wet lab scientists with little programming experience. In this project the student will use the new OpenMS-WebApps template to create a simple GUI to allow researchers to perform this calculation. This calculation can be performed using pyopenms as demonstrated here Furthermore, this outcome web application further can be extend to other MS calculations tasks (e-g theoretical spectra generation etc) for quick interpretation of MS data.

2) Write a generic visualization app for mzQC

Proposed Mentors: Chris Bielow, Arslan Siraj
Skills: Visualization, Controlled Vocabularies, Python|R
Estimated Project Length: 175 hours | Difficulty: Easy to Medium

Adoption and public exposure of quality control in mass-spectrometry (MS) has gained increasing traction in recent years. The Proteomics Standard Initiative (PSI) has developed an open exchange format named mzQC, which aims to foster capturing, exchanging and archiving quality control related data across all MS-based OMICS, such as proteomics, metabolomics and lipidomics. Currently, there exists no package which is capable of visualizing and summarizing the content of any given mzQC file (e.g. as obtained from a publication’s supplemental material).

Tasks:

Pick a visualization framework of your choice (e.g. Streamlit or R Shiny) and write code (Python or R) to allow a user to explore the content of a given (uploaded) mzQC file.
Visualization could be a textual summary as well as (interactive) plots for the QC data contained within the mzQC file. Depending on the metrics properties, automated plot types should be chosen.

3) Improving PyOpenMS Tool Parameter Accessibility

Proposed Mentors: Tom Müller, Joshua Charkow
Skills: Python, Cython, Git
Estimated Project Length: 175 hours | Difficulty: Medium

PyOpenMS provides Python bindings for OpenMS, a powerful open-source C++ library for computational mass spectrometry. These bindings are automatically generated using Cython and the autowrap package. While C++ developers prioritize performance and fine-grained control, Python developers emphasize readability and ease of use. Simply exposing a C++ API to Python often results in an interface that feels unnatural to Python users. One area where this is particularly evident in PyOpenMS is the handling of parameters in OpenMS algorithms. Currently, retrieving and modifying parameters requires multiple steps: instantiating an object from an algorithm class, calling a method to obtain parameters, modifying them through a dedicated ‘Param’ class, and applying them via a setter method. This workflow is complex and does not align with Pythonic conventions. The task is to improve the usability of PyOpenMS by enabling parameter setting directly via keyword arguments when instantiating objects of an algorithm class and making parameters easily accessible through the help() function for interactive exploration.

Tasks:

Modify the Cython binding for function type checking and conversion.
Add Pythonic helper functions for type checking and conversion.
Port the C++ tool documentation to Python.

Theme B) Data Formats and Interoperability

1) Pythonic mzML handling

Proposed Mentors: Joshua Charkow, Tom Müller Skills: Python, Git Estimated Project Length: 175 hours | Difficulty: Easy

Liquid-Chromatography-Mass Spectrometry data is commonly stored in sparse multidimensional data containing billions of peaks and can easily exceed a few gigabytes when compressed on disk. Python is commonly used in the field for exploratory analysis and development of novel data processing algorithms. The most common format for storing mass spectrometry data is in .mzML, an XML based format Although numerous libraries exist for parsing .mzML files in Python, an open source XML based format most commonly used in the field, each library contains its own UI which might not be intuitive for data scientists just starting with mass spectrometry. Recently, the alphatims package was introduced which stores mass spectrometry data in a pandas dataframe like structure making it quite accessible data manipulation and exploration. Notably, this format takes advantage of python’s slicing syntax making it intuitive for all data scientists. However, this package only supports “.d” files which are vendor specific and not applicable to the broader mass spectrometry community. In this project, the student will create a “pythonic” .mzML file reader inspired by alphatims by extending the current available .mzML python parsers.

Tasks:

Leaverage the pyopenms documentation to get familiar with pyopenms .mzML file reading.
Create a class which allows for splicing of the .mzML file across various dimensions and returning a DataFrame object.
Benchmarking this class on various .mzML files for read times and memory usage.
creating a python package and releasing it on PyPI.

2) Write a C++ library to read/write mzQC

Proposed Mentors: Chris Bielow, Nils Hoffmann Skills: C++, Controlled Vocabularies, JSON, CMake, GitHub CI
Estimated Project Length: 350 hours | Difficulty: Easy to Medium

Adoption and public exposure of quality control in mass-spectrometry (MS) has gained increasing traction in recent years. The Proteomics Standard Initiative (PSI) has developed an open exchange format named mzQC,which aims to foster capturing, exchanging and archiving quality control related data across all MS-based OMICS, such as proteomics, metabolomics and lipidomics. Currently, there exist core libraries to read and write mzQC in Python, R, and Java. See MS-Quality-Hub.

Tasks:

implement a new mzQC Core library in C++ which supports reading/writing of mzQC
publish the library on GitHub under a permissive license (BSD-3clause) as a subproject of MS-Quality-Hub.
write class/unit tests and run them using GithubActions
integrate your library into OpenMS (incl. adaptation of the build system to include your library) and substitute existing code to create an mzQC

3) Integrate Apache Parquet into OpenMS Build System

Proposed Mentors: Timo Sachsenberg, Samuel Wein
Skills: CMake, GitHub CI, C++, Python
Estimated Project Length: 350 hours | Difficulty: Medium

Proteomics and metabolomics mass spectrometry studies are generating datasets of unprecedented size as they scale to include more and more samples. Managing and processing these large datasets efficiently requires robust and scalable data handling solutions to make results readily available for downstream processing tasks like machine learning.

The task is to integrate Apache Parquet, a columnar storage format, into OpenMS as a fundamental step toward enabling faster data processing, reducing memory usage, and improving scalability. The integration will involve:

Updating the OpenMS build system with new CMake configurations.
Developing comprehensive tests to validate functionality and performance.
Adapting CI/CD pipelines for macOS, Linux, and Windows to ensure cross-platform compatibility.

4) Universal Mass Spectrometry Data Processing in Python

Proposed Mentors: Wout Bittremieux, Janne Heirman
Skills: Python, R, GitHub CI
Estimated Project Length: 350 hours | Difficulty: Medium

A major challenge in mass spectrometry (MS) data analysis is the lack of interoperability between different open-source software tools. While various data processing packages in Python exist, many of these suffer from development inefficiencies due to having to implement duplicate functionality.

This project aims to address these inefficiences by integrating leading open-source MS processing and visualization tools, such as spectrum_utils, pyOpenMS, Pyteomics, and matchms. The goal is to seamlessly connect these tools, harnessing similar internal MS data representations across different tools, allowing users to easily move between different software tools without redundant re-implementation of core functionality.

To achieve this, we will develop translation layers between different MS tools, enabling smooth data exchange both within and across programming languages; optimize existing algorithms to handle the ever-growing size of MS datasets efficiently, ensuring faster and more scalable data processing; and create a unified workflow that makes MS analysis more intuitive, accessible, and powerful for researchers worldwide.

By building these bridges, this project will empower scientists to focus on discoveries rather than data format headaches, fostering collaboration and innovation across the mass spectrometry community.

Theme C) Machine Learning and Advanced Computational Methods

1) Diffusion Deconvolution of DIA-MS/MS Data (D^4 | dquartic)

Proposed Mentors: Justin Sing, Leon Xu
Skills: Python, PyTorch, Deep Learning
Estimated Project Length: 350 hours | Difficulty: Medium to Advanced

Diffusion models have revolutionized generative AI, excelling in tasks like image enhancement and speech separation. This project applies similar principles to Data-Independent Acquisition Mass Spectrometry (DIA-MS)—a technique that captures complex, overlapping signals requiring deconvolution. This project offers an exciting opportunity to apply deep learning & generative AI techniques to a real-world bioinformatics challenge. If you enjoy working on AI models for signal processing, deconvolution, and scientific applications, we’d love to have you on board!

Problem: DIA-MS produces two correlated data types: MS1: A low-resolution “overview” (analogous to a blurry image or background noise in audio). MS2: Detailed but highly multiplexed signals (akin to overlapping voices in an audio recording).

The goal is to train a diffusion model to deconvolve and denoise MS2 signals, using MS1 as a guiding signal—similar to how Stable Diffusion refines blurry images or Whisper (OpenAI) separates speech from noise.

Current Progress & Next Steps:

A baseline U-Net-based diffusion model already shows promising results on synthetic mixtures of DIA-MS data. The next phase is to:

a. Train the model on raw DIA-MS data for direct peptide signal separation. b. Integrate MS1 and MS2 peptide feature masks (from OpenSwath, DIA-NN, or Spectronaut) as conditioning signals—similar to how segmentation masks guide image super-resolution. c. Investigate pseudo-DDA spectra generation (e.g., diaUMPIRE, diaTracer) by incorporating intermediate deconvolution steps.

Tasks:

Optimize Data Loading: Implement efficient pipelines for handling large-scale DIA-MS data.
Improve Memory Efficiency: Apply quantization to reduce model footprint, similar to vision models.
Enhance Conditioning Signals: Refine how the model extracts targeted peptide signals from MS1/MS2 data.
Explore Alternative Architectures: Test transformer-based backbones (e.g., ViTs) for potential performance gains.

2) Optimizing Casanovo for Fast and Accurate De Novo Peptide Sequencing

Proposed Mentors: William Stafford Noble, Wout Bittremieux
Skills: Python, PyTorch, deep learning, profiling
Estimated Project Length: 350 hours | Difficulty: Advanced

De novo peptide sequencing is a powerful approach for molecular discovery by deciphering peptides directly from tandem mass spectrometry (MS/MS) data. Casanovo is a state-of-the-art AI tool that treats de novo sequencing like a language translation problem—converting sequences of peaks in an MS/MS spectrum into amino acid sequences, just like translating one language into another. Powered by a transformer deep neural network, Casanovo has already revolutionized peptide sequencing, but its inference speed remains a bottleneck, particularly during beam search decoding, the step responsible for selecting the best peptide candidates.

By making Casanovo faster and more efficient, we can unlock new biological insights at an unprecedented scale. Whether it’s identifying unknown proteins, uncovering disease biomarkers, or advancing drug discovery, your contributions will have a real-world impact on science and healthcare. If you love machine learning, performance optimization, and AI-driven discovery, this project is your chance to make a difference in computational biology!

This project aims to optimize Casanovo’s speed, enabling researchers to process larger datasets, make new discoveries faster, and push the boundaries of proteomics research. You will:

a. Profile performance bottlenecks in Casanovo’s inference pipeline to pinpoint slowdowns. b. Optimize beam search decoding and other key computations to improve runtime efficiency. c. Enhance scalability, ensuring Casanovo can handle the growing demands of big-data proteomics.