Data mining, in mass spectrometry, entails, for a large part, the relentless scrutinization of the mass spectra by an expert eye. Without a powerful mass spectrum viewer, capable of numerous data display modes, the expert eye remains powerless.
After having completed this chapter you will be able to perform mass spectrum visualization and analysis, optionally reporting all the analysed peaks to a file on disk.
To start a mineXpert session, open one or more mass spectra using the menu from the menu.
mineXpert understands the mzML format and the file loading procedures are delegated to the excellent libpwiz library from the ProteoWizard project[6]. Simple txt,asc data where m/z and i values are separated by any character that is neither a newline nor a dot nor a digit (loading is handled by a private parser) can be loaded either from file or directly from the clipboard. A third format is SQLite3, a private open/documented database format (private parser) that is used in mineXpert to allow slicing very big datasets into smaller chunks. Incidentally, the SQLite3 format allows for faster data loads.
There are two variants of the mass spectrometry file opening menu, one for which all the mass data are read from file and stored in memory and one for which the mass data are read from file in streamed mode, used to compute the TIC chromatogram and discarded. The latter mode is useful when the mass data are so large that they cannot fit in memory. The TIC chromatogram that is computed in streamed mode is then used to access the mass data in the file according to criteria set by the user (retention time range, for example).
The graphical interface of mineXpert comprises a number of windows where data and informations are displayed. These windows are described below (see Figure 2.1, “General view of the graphical user interface”):
mineXpert main program window: this is an unintrusive window sporting the main menu and a status bar where messages are displayed;
The Loaded mass spectrum files window, that lists all the mass spectrometry data files that are currently loaded in the program;
The TIC chromatogram window[7] where the various TIC chromatograms are displayed for the various mass spectrometry data files that have been loaded. There is, by definition, a single TIC chromatogram per data file currently loaded in the program. However, this window will also display TIC chromatograms that are computed as an integration step from the other windows, like from the Mass spectrum window or from the Drift spectrum window. In this case, the chromatogram is an extracted ion current chromatogram (XIC chromatogram);
The Mass spectrum window, where the various mass spectra are displayed. A given mass spectrum may originate from a TIC chromatogram or from a drift spectrum, or even from a color map. A given originating chromatogram or drift spectrum or color map may be the origin of more than one derived mass spectrum;
The Drift spectrum window, where the various drift spectra are displayed. Drift spectra can originate from the TIC chromatograms, from the mass spectra or from the color map;
The Color map window, that contains a single color map for each loaded mass data file. At the time of this writing, there is no way to produce a color map from any other window;
The m/z integration parameters window, where the parameters governing the mass data integrations to a mass spectrum are set;
The XIC extraction parameters window, where the parameters governing the XIC extractions to a XIC chromatogram are set;
The Console window, where the various messages or analysis data elements are displayed for the user to select, copy and paste in an electronic lab-book;
The menu bar in the main program window displays a number of menu items, reviewed below:
“full” descriptor indicates that the user wants to actually load the full data set in memory. This means that she explicitely knows that the system's memory will cope will all the data in the file;
-> Choose the mass spectrum file(s) to load. Note that the“streamed” descriptor indicates that the user wants not to actually load the full data set in memory. This is typically the case when the data file is so large that its data cannot fit in memory. The program then only “looks” at the data in the file and crafts, piecemeal, the TIC chromatogram and the color map. In this context, any other data integration will be performed by looking into the same mass data file since no data are available in memory;
-> Choose the mass spectrum file(s) to load. Note that the-> creates a mass spectrum from a textual representation of (m/z,i) pairs in the same format as described above for the txt,asc file format;
-> Define the analysis preferences. The analysis preferences govern how the data about scrutinized
Loaded mass spectrum files window are all removed. Note that this releases all the memory that was used by the data. This menu is equivalent to “closing all files”;
-> Clears all the plots currently displayed in the program. The plot items in the
The menus are self-explanatory, as they explicitely explain which window is to be shown. The
menu records on disk the position and size of all the windows, so that upon reopening the program, the windows all position themselves at the recorded position and size;This menu's items show help about the program itself and also about the Qt libraries that were used to build it. These informations are essential in case the user wants to make a bug report.
This section will succinctly describe the main data windows of mineXpert. Each window will be described in greater detail when the features of the program will be described.
Each time a new mass spectrum file is loaded, its corresponding TIC chromatogram is computed and then displayed in a new plot widget in the TIC chromatrogram window (Figure 2.2, “The total ion current (TIC) chromatogram window”). Each new TIC chromatogram plot generated as a result of the loading of a mass spectrometry data file is plotted using a new color. That color encodes the filiation of the whole set of plots that are generated starting from that initial TIC chromatogram plot. For example, a red TIC chromatogram plot that serves as the starting point for a mass spectrum integration will trigger the creation of a mass spectrum plot widget that will have a red graph in it. Same is true for the color map widget that has its axis and tick labels of the same color as that of the TIC chromatogram plot.
The attention of the reader is drawn on the specific situation whereby the user loads mass spectrometric data from a non-profile acquisition data file or from clipboard data. For example, when a mass spectrum is opened from a txt,asc,xy text-based format file where the data correspond to a single spectrum, not a sequence of spectra, the TIC chromatogram really has a single (rt,i) pair denoting the TIC intensity a the single retention time of that very unique spectrum. The TIC chromatogram window thus artificially creates and displays a TIC chromatogram that is a simple line, like shown in Figure 2.3, “Loading a single-spectrum data file or a mass spectrum from clipboard data”. Because the is a single point, the user has nothing to do than “integrate” these data to a mass spectrum. This is the reason why this integration is actually performed automatically and the spectrum is thus shown in the Mass spectrum window.
The mass spectrum window contains all the plot widgets that display mass spectra that originated in other windows. For example, the user might select a region in a TIC chromatogram and then ask that a mass spectrum integration be computed. In this case, the resulting mass spectrum is displayed in a new plot widget that is located in the mass spectrum window (Figure 2.4, “The mass spectrum window”).
As described for the mass spectrum window, the drift spectrum window contains all the plot widgets that display drift spectra (Figure 2.5, “The drift spectrum window”).
The color map window displays a color map view of the drift data in the form of m/z vs drift time (dt). The intensity of the m/z values is coded in colour. The axes can be switched, such that either the m/z vs dt or the dt vs m/z representation can be obtained (see toolbar button on Figure 2.6, “The color map window”).
The TIC chromatogram, mass spectrum and drift spectrum windows are all structured in a similar way. The window is divided vertically in two compartments. The bottom compartment will host all the plot widgets stacked vertically. The top compartment hosts a single plot widget where all the graphs that are displayed unitarily in the lower compartment are shown superimposed.
The plot widget that is packed in the top compartment of the window is called the “multi-graph” plot widget because it can hold more than one graph. The plot widget(s) that is(are) packed in the bottom compartment of the windows is(are) called “single-graph” plot widget(s) because each plot contains only one graph.
The two vertical compartments of the window are resizable by dragging the sliding horizontal bar that separates them. It is possible to totally occlude one of the compartments by dragging that sliding bar all the way up (or down) to the window side.
The behaviour described above does not apply to the Color map window that has no upper compartment with all the color maps superimposed.
The TIC chromatogram, color map, mass spectrum and drift spectrum windows all contain plot widgets (or color map widgets) that have a general working scheme as to how the data can be visualized. The main visualization operations are succintly described below. The following convention will be used to describe the mouse buttons:
: left mouse button;
: middle mouse button;
: right mouse button.
The different plot or colormap graph visualization methods are detailed below:
Zooming in and zooming out:
Zoom in: -click-drag to draw a selection rectangle. When the
mouse button is released, the new plot view contains the data
contained in the selection rectangle;
Zoom in: -click-drag along the X-axis over the region to zoom
and release the mouse button. The new zoomed view does not
automatically scale to full scale in the Y-axis direction. To ensure
that the new view automatically scales on the Y-axis, press
Shift while releasing the mouse button. When the zoom in
operation is performed using Shift in the multi-graph widget,
the Y-axis is set to full scale with respect to the point having the
maximum intensity of all the graphs being shown at that moment;
As seen in the figure above, the region defined by the -click-dragging
operation is delimited by green and red markers, respectively at the start
and at the end of the selection. The distance between the start and end
points is updated along the mouse move operation.
Zoom in/out: -click-drag on the X- or Y-axis to interactively
zoom in or out along the selected axis. In this mode, the zoom
operates by contracting/expanding the data in such a manner that the
left/bottom part of the graph (the origin of the graph) is anchored
and does not move. When the drag occurs towards larger values on the
clicked axis, the view is zoomed in along that axis. Conversely, it is
possible to zoom out by dragging the mouse towards lower axis values.
When the number of points in the plot is so large that the zoom
operation is sluggish, pressing Ctrl will fluidify the zoom
operation;
Zoom in/out: The -wheel-rotation can be used to zoom in or
out the whole plot on both the X- and Y-axis simultaneously. Note that
the position of the mouse cursor when the wheel is rolled defines the
new view of the plot. Practising a bit allows to make that zooming
in/out mode very powerful.
Zoom out: To reset the zoom along one axis, -double-click
that axis. In this case, only the clicked axis will be full-scale,
the other axis remains unchanged. To reset the zoom such that the full
scale is calculated on the data set displayed after the zoom, maintain
the Shift key pressed when double-clicking. To reset the zoom
on both axes in one go,
-double-click one of the axes maintaining
the Ctrl key pressed;
Panning:
-click-drag on one of the axes to pan the plot view along
that axis;
History:
Each time a new zoomed in/out view of the plot is triggered, a history element is stored in the plot widget. To back-replay the various steps of the zoom in/out operations in sequence, from pre-last to first, hit the Backspace key. The exceptions to this mechanics is when the plot view is panned or when the mouse wheel is used.
The tool bar located at the top of the windows described above contains two buttons that allow to lock the x axis (the button icon has the horizontal red line) and/or the y axis (red line is vertical) range throughout all the graphs displayed in the window. This of great use when the user wants to compare a number of graphs that have been obtained on comparable samples. The movements and zooming-in or zooming-out operations in one graph are then synchronized to all the other graphs.
The third button performs a transpose operation. When the color map is initially created, the horizontal axis (the keys of the map) is the drift time axis and the vertical axis (the values of the map) is the m/z axis. The transpose operation switches the representation of the map such that the axes are inverted.
Analyzing mass spectrometric data (with or without drift data) usually involves performing various data integrations in sequence. We saw earlier that the first data that are plotted upon loading a mass spectrometry data file are the TIC chromatograms along with (if applicable) the m/z vs dt color maps. These two graphed data sets are the starting points for the mass spectrometric data mining, that may involve the following integration operations:
TIC chromatogram to mass spectrum This
kind of operation is triggered upon -click-dragging the mouse
over the region of interest and maintaining the S
key pressed. mineXpert integrates all the spectra that have been
acquired at all the retention times between the start and the end of
the selected region. A new mass spectrum is then plotted in a new
plot widget in the mass spectrum window;
As seen on the figure above, the region defined by the
-click-dragging operation is delimited by arrows, a green marker
at the start and a red marker at the end.
TIC chromatogram to drift spectrum This kind of operation is similar to the one described above, unless the D key must be pressed. As above, a new drift spectrum is appended to the drift spectrum window.
Color map to mass spectrum This operation
involves -selecting a rectangular region of interest on the color
map and by maintaining the S key pressed. A new mass spectrum
is then plotted in a new plot widget in the mass spectrum window. Note
that, due to a bug in the plotting library, the rectangle is not
currently drawn on top of the color map.
Color map to drift spectrum Same as above, but with the S key pressed. As above, a new drift spectrum is appended to the drift spectrum window. Same remark as above.
The same mechanics is at work in the other plot widget windows. For example, to trigger the integration of a mass spectrum starting from a drift spectrum, simply drag the mouse over the drift spectrum and maintain the S key pressed.
Rule of thumb: when a maSs spectrum is to be generated, use the S key, when a Drift spectrum is to be generated, use the D key and, finally, when a Retention time TIC chromatogram is to be generated, use the R key.
One of the most interesting features for detailed mass data
mining is the integration to a TIC intensity ( -click-drag with
I pressed). That integration can be triggered from
any of the data windows (any single-graphplot
widget in any of these windows, that is). No plot is created, the
data are simply displayed in the status bar of the window and in the
Console window.
It is important that the mouse is maintained still right after having triggered the intensity calculation because otherwise the status bar message displaying the result vanishes.
This integration will be discussed later.
Figure 2.9, “The tool bar and its buttons” shows the various buttons of a plot window. The button with the `?' character will show a tooltip describing the various keyboard/mouse combinations to use to trigger the various data combinations described ealier.
The “filiation” of the plots is maintained using identifying colors. However, color is not enough to unambiguously identify the “filiation” of any given plot. Indeed, the same TIC chromatogram or Color map plot can be used multiple times to perform integrations. The newly created plots will have the same color as the originating plot, but it will not be possible to distinguish between all the “child” plots. This is why the plots maintain a “history” of the way they have derived from the initial TIC chromatogram/Color map plot. This history is shown in a small widget that shows up when the O key is pressed while the mouse cursor hovers over the widget at hand. One example of plot “history” is shown in Figure 2.10, “The plot filiation history widget”.
This figure shows the filiation history of plots. As shown, the first integration was performed when loading the mass data file in order to produce the TIC chromatogram. The first history item is thus [File -> RT], indicating that the plot widget graph originates from loading a file and computing the TIC chromatogram (RT stands for retention time).
The computed TIC chromatogram served as a starting point for an integration to a mass spectrum. The second history item shows just this, in addition to the first item that is still shown: [RT -> MZ]. The range shown indicates that the integration was performed over the specified range of RT values.
The final integration step was to compute a drift spectrum starting from the mass spectrum and that is denoted using the [MZ -> DT] expression. The concerned range is also shown.
As can be seen in the various filiation history items of Figure 2.10, “The plot filiation history widget”, there is always a History - innermost ranges section that lists the three ranges for RT, MZ and DT. What is that “innermost range” concept? The idea is that, when chaining integrations, each new plot reflects an always smaller subset of the initial dataset that was loaded from file. The RT, MZ, DT ranges may thus be reduced progressively. For each of these RT, MZ, DT properties, the innermost ranges is just that: the smallest range that is currently plotted in the plot widget at hand. An example is worth a thousand words:
An [RT -> MZ] integration starting from a TIC chromatogram;
The m/z range obtained is [500-2500]. From that mass spectrum, the user integrates to a drift spectrum [MZ -> DT];
Then, from the drift spectrum, a peak seems interesting and the user back-integrates to a mass spectrum [DT -> MZ];
In the mass spectrum, a mass peak is of interest and the user want to see at which retention times the m/z value elutes: she does an integration [MZ -> RT].
The innermost MZ range in the XIC chromatogram obtained at step 4 will be the last m/z range selected for the mass peak of interest, not the range at step 2.
Depending on the integrations that are triggered in the various data plot/map widgets, the computations vary significantly. This section will describe the general computation algorithms in such a manner that the mineXpert user can grasp what is actually going on in the guts of the software. The integrations to a mass spectrum are particularly sensitive to some parameters that will be described in detail in the following section.
This integration occurs when the user -selects a range in a given
plot while pressing the S key. Integrations to a mass
spectrum can be elicited from a TIC chromatogram plot, a color map plot or a
drift spectrum plot. In all these cases, the integration computation (that
is, a mass spectral combination) needs to be aware of the kind of data at
hand.
In order to clarify what integration means in the context of the creation of a mass spectrum, that is, the summative integration (also known as “combination”) of any number of mass spectra, the following describes a combination in detail.
In this example, the user has loaded a mass data file obtained after an acquistion of mass data in profile mode. mineXpert calculates the TIC chromatogram right after having loaded the mass data. The user performs an integration for a given retention time range in the TIC chromatogram. If we consider an integration range [0–15] min, this is what would occur in the guts of mineXpert. In this example, we omit any step corresponding to any binning.
First of all, create a new mass spectrum (let's call it newMS, also known as the “combination spectrum”, that is, the result spectrum);
Extract from the mass spectrometry data all the spectra that have their internal rt value (retention time) contained in the [0–15] min interval. The list of extracted mass spectra (let's call that list msL) is then processed as follows:
Iterate in msL and for each iterated mass spectrum (iterMS):
Iterate in all the (m/z,i) pairs of iterMS and for each one check if the m/z value was already found in any of the previous mass spectra, that is, if a (m/z,i) pair in newMS has that m/z value. If:
the m/z was not found, copy the (m/z,i) pair in newMs;
else if the m/z value was already encountered in previously iterated mass spectra, increment the intensity of the corresponding (m/z,i) pair of newMs by the value of the iterated (m/z,i) pair. This is where the summative combination of mass spectra is at work.
At the end of this process, newMS will correspond to the summation of all the spectra contained in the msL list. The newMS mass spectrum is then plotted in the mass spectrum window as a new plot. The color of the newMS plot is the same as the color of the initial TIC chromatogram plot.
The process described above can only work in very limited circumstances, with data files generated with particular instruments. In general, this process does not lead to a usable mass spectrum, as described in Figure 2.11, “Unusable combination spectrum without binning”. In this combination mass spectrum, computed from Lumos Orbitrap-originating data, the plot shows what should have been a high resolution monoisotopic peak (the m/z delta of the whole signal is 0.009). As can be seen, the signal in this mass spectrum is totally useless and the integration to a mass spectrum requires binning to overcome the presence of so many peaks in that 0.009 m/z interval.
mineXpert provides a number of ways to configure mass spectral combinations such that the obtained mass spectrum is usable. The m/z integration parameters that might be set are described in the following sections.
Loading data from mass data files in mzML format does not guarantee that the data will be of the same kind when they originate from different mass spectrometers. For example, data from Orbitrap mass spectrometers have the following characteristics:
All spectra do not start at the same m/z value;
All spectra do not have the same number of data points (they do not have the same size);
A large number of data points might have 0 values (intensity at a given m/z value is 0);
The m/z delta between two consecutive m/z values is not constant, and this is the major difficulty for data integration to a mass spectrum.
This is the output of the statistical analysis of the data loaded from a Lumos Orbitrap-originating file:
Spectral data set statistics:
Total number of spectra: 6203
Average of spectrum size: 391.311946
StdDev of spectrum size: 168.062934
Mininum m/z value: 400.007111
Average of first m/z value: 401.448935
StdDev of first m/z value: 1.590049
Maximum m/z value: 1999.928589
Average of last m/z value: 1901.852315
StdDev of last m/z value: 45.864131
Minimum m/z shift: -0.344452
Maximum m/z shift: 0.000000
Average of m/z shift: 1.097372
StdDev of m/z shift: 1.590049
Smallest Delta of m/z (step): 0.006195
Average of smallest Delta of m/z (step): 0.023757
StdDev of smallest Delta of m/z (step): 0.013179
Greatest Delta of m/z (step): 405.356934
Average of greatest Delta of m/z (step): 163.112057
StdDev of greatest Delta of m/z (step): 75.947334
As mentioned earlier, the most interesting bit of information is in the line reproduced below:
Smallest Delta of m/z (step): 0.006195
That 0.0062 value somehow gives an indication of the “definition” of the spectrum, that is, the smallest distance possible between two consecutive points in the m/z axis.
In general, the fact that the spectra of an acquisition do not all have the same m/z vector as the m/z axis is a great difficulty for mass spectral integration because it requires setting up binning prior to performing the mass spectral combination. That binning is nothing else than crafting a m/z value vector able to receive the intensities of all the m/z data points in the spectra to be combined. These concepts are developed in the following paragraphs.
At the end of the data file loading, mineXpert performs a rudimentary statistical analysis of the data. The main datum of interest is the smallest m/z step that is observed in the whole set of mass data loaded from disk (the mass spectrum list, that can hold mass spectra in the thousands). For each mass spectrum in the list, the smallest m/z delta between any two consecutive data points is recorded. Then, the smallest ever m/z delta value is sought amidst all the recorded values. Intuitively, that smallest m/z delta value provides an idea of the resolution power of the instrument that generated the mass spectra. Interestingly, this is not the proper value to configure binning. The best value is the median value of the smallest m/z delta values encountered over all the mass spectra of the data file. It is the value that is suggested by default to arbitrarily construct the bins during an integration to a mass spectrum, as described in Figure 2.12, “The m/z integration parameters window” (Arbitrary binning value with bin size unit MZ).
The bin size units, when using Arbitrary binning, might be MZ, PPM or RES. In the two latter cases, the bin size changes along the m/z axis. It increases along with increasing m/z values. For example, if the bin size unit is PPM and the bin size is 10, then at m/z 300, the bin size would be m/z 0.003, while at m/z 2000, the bin size would be m/z 0.02. If the bin size unit is RES and the bin size is 10000, then at m/z 300, the bin size would be m/z 0.003, while at m/z 2000, the bin size would be m/z 0.02.
Once the Arbitrary binning is selected in the m/z integration parameters window, the bin size and bins size unit have been set, the program creates the bins in the combination mass spectrum according to these settings. The first bin and the last bin are simply the smallest and greatest m/z values found in all the spectra to be combined. The program fills in the void in between these two values in steps matching the bin size with or without PPM/RES ponderation: In case the bin size unit is MZ, there is no specific calculation; if the bin size unit is either PPM or RES, then the bins are calculated accordingly, as shown in the examples above. Once the bins have been set up in the combination spectrum, the actual combination of all the mass spectra can take place.
This section provides some examples of how the integration parameters might impact the mass spectrum resulting from combination of mass spectra. Also, this section details the general guidelines for ensuring the best combination calculation.
When Data-based binning is recommended. Data-based binning means that the bins in the combination spectrum are nothing but the m/z values of the first spectrum of the mass spectral set to be combined. This is the simplest integration mechanism and is recommended when the mass data are perfectly coherent, that is, when all the mass spectra are rooted in the (roughly) same m/z value and the vector of m/z values along the m/z axis is reproduced over all the mass spectra of the combination set. This situation is exemplified in Figure 2.13, “Bruker microQTof acquisition of a protein mass data”.
In this example, not all the m/z integration parameters produce acceptable results for the combination mass spectrum. The integration with no binning produces an unusable spectrum. Note that the best results are for Data-based binning. This is because the m/z data from the Bruker microQTof instrument are very reproducible from a spectrum to the other. Setting the bins exactly as they are in the first spectrum of the mass spectrum list to be combined is thus efficient.
These combination spectra were obtained by performing a mass spectral integration of mass data acquired for a protein solution in a microQTof Bruker instrument. The mass acquisition settings were characteristic of a protein analysis in the 25–35 kDa range. The top spectrum was obtained by performing an integration with no binning at all. The spectrum is useless.
The statistical analysis of the mass data calculated after loading of the mass data had shown that the median smallest m/z delta value was of 0.017. The middle spectrum was obtained after an integration with an arbitrary binning of size 0.017 and with bin size unit MZ (constant bin size throughout the m/z vector). The result is much better than the one obtained earlier. Some glitches are still visible, but the data are eminently usable. The bottom spectrum was obtained by performing a combination with Data-based binning. This result is the best.
The setting up of bins ultimately consists in creating a mass spectrum out of preexisting data (the first mass spectrum of the set in the case of Data-based binning) or out of arbitrary values (the smallest and greatest m/z values of the spectral set, the bin size and finally the bin size unit). In the latter case, the data points making the newly created mass spectrum have their m/z value calculated and their intensity set to 0. Because the m/z value is calculated starting from an arbitrary bin size value, it might be possible that not a single data point in the whole set of mass spectra has a m/z value matching that bin m/z value. In that case, the m/z data point still has a 0-intensity value at the end of the mass spectral combination. This is illustrated in Figure 2.14, “Removing 0-intensity data points”. When the 0-intensity data points are not removed (upper spectrum), the signal is deteriorated by these inverted spikes. Removal of the 0-intensity data points, cleans the trace perfectly.
The Savitzky-Golay filtering method is widely known for its effectiveness in removing noise from mass spectral data. It is possible to apply that filter at the end of a mass spectral combination. The m/z integration parameters window allows setting the Savitzky-Golay parameters:
nL: specifies the number of data points to the left of the point being filtered;
nR: specifies the number of data points to the right of the point being filtered. The total number of points in the window that is considered for the regression is thus nL + nR + 1.
m: specifies the order of the polynomial to use in the regression analysis leading to the Savitzky-Golay coefficients (typically between 2 and 6);
lD: specifies the order of the derivative to extract from the Savitzky-Golay smoothing algorithm (for regular smoothing, use 0);
This integration occurs when the user -selects a range in a
given plot while pressing the D key. In the detailed
example below, the integration occurs for a given retention time range
in the TIC chromatogram (integration range
[0–15] min):
First of all, create a <dt,tic> map to store all the drift time values encountered below, along with the cumulated total ion current intensity value of the spectra acquired at the corresponding dt drift time;
Extract from the mass spectrometry data all the spectra that have their internal rt value (retention time) contained in the [0–15] min interval. The list of extracted mass spectra (msL) is then processed in such a manner that each mass spectrum (ms) it contains is iterated over:
Get the dt at which the ms was acquired;
Calculate the total ion current (tic) for the ms;
In the <dt,tic> map, check if the dt value was already found. If:
The dt was not already found, create one (dt,tic) pair and insert it in the map;
Else if the dt was already encountered in previously iterated mass spectra, increment the tic value of the corresponding (dt,tic) pair in the map by the tic value calculated above for ms.
At the end of this process, the <dt,tic> map will correspond to the drift spectrum. That spectrum is then plotted in the drift spectrum window as a new plot with the same color as that of the initial TIC chromatogram plot.
This integration occurs when the user -selects a range in the TIC
chromatogram plot while pressing the I key. The
integration is performed by looking into the mass data for (m/z,i) pairs
that match the current integration history of the current data plot and
sums all the intensities to yield a final TIC intensity value. This value
is printed in the status bar of the window. Be aware that if you move the
cursor right after having performed the computation, the message in the
status bar of the window will be erased. In this case, that same value is
printed in the console window in the same color as the color of the plot
from which it was computed.
Worthy of note is the fact that this kind of integration can be performed in the exact same way in the various data plots (TIC chromatogram, mass spectrum, drift spectrum, mz=f(dt) color map).
The user, in the process of mining the data, will inevitably chain integrations to pinpoint a specific feature of interest. For example, let's say that the user is mining ion mobility mass spectrometry data. After having loaded the data file, the colormap is computed and displayed (see Figure 2.15, “Example of chained integrations”))
There starts the exploration. The user sees that there are a number of
species having discrete drift times at the m/z ratio around 1220 (lower
region of the colormap). She thus integrates to a single drift spectrum
( -click-drag with D pressed) that horizontal lower
region of the colormap. The obtained drift spectrum is shown at the
right hand side of Figure 2.16, “One example of chained integrations”.
Because there are multiple drift peaks in the drift spectrum, the user
perform individual mass data integrations to a mass spectrum for each
drift peak ( -click-drag with S pressed). The mass
spectra obtained are all shown in the middle window of the same figure.
Most interestingly, the various drift regions are integrated to almost
identical m/z values in their respective mass spectrum. In order to
know when the various molecular species eluted in the chromatogram, the
user performs for each mass spectrum an integration to a XIC chromatogram
( -click-drag with R pressed). It is then visible
that each molecular species was eluting from the TIC chromatogram at
discrete retention times (this was clearly not a true
chromatography but instead an infusion with instrument parameters changed
during the acquisition).
The dataset used for the Section 2.9, “Chained Integrations” section are kind courtesy of Dr. Valérie Gabelica and correspond to a work entitled Optimizing Native Ion Mobility Q-TOF in Helium and Nitrogen for Very Fragile Noncovalent Structures published in JASMS with DOI: 10.1007/s13361-018-2029-4.
When analysing a mass spectrum, two major deconvolutions are performed to get back to the Mr mass of the analyte while reading m/z values: the charge-based deconvolution and the monoisotopic cluster-based deconvolution. In the following sections, both deconvolutions are described.
In this kind of deconvolution, at the present time, the software assumes that the ionization agent is the proton and that the ionization is positive.
The deconvolution is based on the determination of the distance
between consecutive (or not) peaks of a given charge state envelope.
When the user -click-drags the cursor from one peak to another, the
program tries to calculate if the distance between two peaks matches a
charge difference. If so, it computes the molecular
(Mr) mass of the analyte whose mass peak is
located under the cursor. Figure 2.17, “Charge-based mass deconvolution (consecutive peaks)”
shows that precise state for two consecutive peaks
of a charge state envelope.
Note that the charge calculation almost never produces an integer value with no fractional part (say, charge z=15.0) because it is almost impossible to drag the mouse cursor the exact m/z range that would lead to such an integral charge value. Almost always, the charge that is calculated looks like 14.995 or 15.001, for example. Why is it impossible to drag the mouse cursor exactly the interval that would produce an integral charge value? Simply because the mouse moves at discrete positions on the screen that might be more or less far apart, depending on the mouse capabilities. The zoom state over the two peaks also plays a role. It is advised to zoom in as much as possible over the peaks at hand so as to minimize the difficulties above. It may happen however that even zoomed in peaks are not sufficiently distant to allow a charge calculation (this is the case in the upper spectrum of the Figure 2.17, “Charge-based mass deconvolution (consecutive peaks)”, where the computation could not be performed). In this case, reduce the stringency over the fractional part that is allowed in the charge. By default, the stringency is set at 0.99, that is, any calculated value that has a fractional part either superior or equal to 0.99 or inferior or equal to 0.01 would lead to a successful round-up/round-down to the nearest integer value. Outside of the ]0.01-0.99[ interval, no charge calculation is performed and thus no deconvolution is performed. When the stringency is too high, reducing it will allow the deconvolution to be carried-over (see bottom spectrum of Figure 2.17, “Charge-based mass deconvolution (consecutive peaks)”).
The status bar of the window documents the current inter-peak
distance measurement operation that is performed by -click-drag of
the cursor starting at the left peak towards the right peak. The start
peak is marked with a green marker and the end peak is marked with a red
marker. Start and end positions are documented in the form
(m/z start, i) -> (m/z end, i). Then, the m/z delta,
that is, the distance between both positions is provided. When the end
position matches a theoretically expected distance corresponding to a
charge difference of 1, then the charge z of the
peak under the cursor is provided and the molecular mass
(Mr) is provided for the analyte whose peak is
under the cursor.
It might happen that two consecutive peaks of
the charge state envelope are not of a good shape enough to point and
click precisely in the center of the peaks. In that case, the software
allows indicating the number of intervals that run between two
-click-drag-connected peaks. This is illustrated in
Figure 2.18, “Charge-based mass deconvolution (non-consecutive peaks)”.
The user knew that she had to measure the distance between two peaks
that were separated by two intervals. She thus incremented the interval
value in the status bar to 2 and performed the
measurement. The Mr value that is displayed is different than the
previous one because without enlarging the window, it is more difficult
to click right at the center of the gaussian shape of each peak.
Theoretically, the Mr values should be identical, and actually are when
the measurements are performed cleanly in widely-laid mass
spectra.
Note that
the -click-dragging direction (left -> right or right -> left)
has an impact on the value of the charge (z) that
is obtained, since that charge value is relative to the peak
under the cursor at the moment of the
deconvolution. Conversely, the mouse-dragging direction has no effect
on the Mr (molecular mass) of the analyte
obtained as a result of the deconvolution process.
In this kind of deconvolution, the user -click-drags the cursor
between the first two peaks (when possible) of the isotopic cluster. The
charge state of the ion is the inverse of the m/z delta value that
is the distance between the two consecutive peaks. Figure 2.19, “Isotopic cluster-based mass deconvolution” shows that
deconvolution process at work.
Note
that the -click-dragging direction (left -> right or right ->
left) has no impact on the value of monoisotopic mass computed because
the software postulates that the lightest ion is the peak on the
left.
When -click-dragging the mouse cursor between two mass spectrum
locations of interest, the program computes the apparent resolving
power. This process is shown on Figure 2.20, “Calculation of the resolving power”, where the
resolving power is calculated by dragging the mouse cursor from one edge
of a peak to the other at half maximum height (this is called
full width at half maximum [FWHM] resolution).
When the resolution of the mass spectrometer is good, zooming-in on a mass peak may reveal that a given ion has given rise not to one peak but to a set of peaks. This set of peaks is called a “isotopic cluster”.
It is possible to predict how a given ion (of a given chemical formula) is supposed to be revealed in a mass spectrum, in the form of such an isotopic cluster. One such cluster is shown in Figure 2.21, “Calculation of the isotopic cluster of an analyte”, for the horse apomyoglobin protein, of elemental composition C769H1213N210O218S2 (this formula is typeset like this intentionally, to show how the formulæ need to be entered in the IsoSpec module).
As of version 5.8.0, mineXpert provides
an interface to the libIsoSpec++
library.
IsoSpec: Hyperfast Fine Structure Calculator
Mateusz K. Łącki, Michał Startek, Dirk Valkenborg, and Anna Gambin
Analytical Chemistry, 2017, 89, 3272–3277
DOI: 10.1021/acs.analchem.6b01459
This library performs high-resolution isotopic cluster calculations. In order to run the calculations, it is necessary to have the following items ready:
An elemental composition formula of the analyte (for example, H2O1). This formula needs to account for the ionization agent that is involved in the ionization of the analyte prior to its detection in the mass spectrometer.
The IsoSpec software requires that all the chemical elements of a chemical formula be indexed. This means that, for water, for example, the formula should be H2O1 (notice the index 1 after the O element symbol).
A detailed isotopic configuration of all the chemical elements that are used in the elemental composition formula. mineXpert provides two interfaces to define the isotopic characteristics of the chemical elements. These will be described in the following sections.
Generating iosotopic clusters using the IsoSpec software package is not easily carried over, in particular because this remarkable library is designed to be highly performant. The authors rightfully put their energy into optimizations for accuracy and speed instead of investing in a graphical user interface. mineXpert provides that graphical user interface, shown in Figure 2.22, “Isotopic cluster calculation dialog window”, that shows up upon selection of the “” →menu.
The dialog window contains two panels. The left hand side panel configures the charge for which the calculation is to be carried over and the maximum cumulative isotopic presence probability that IsoSpec must reach during the calculation. The right hand side panel contains a tab widget that contains the configuration tabs and the results tab.
An isotopic cluster calculation is most probably performed with the aim of simulating an expected isotopic cluster for an analyte that is being analyzed by mass spectrometry. It is thus logical that the analyte be in an ionized form. The way that the analyte has been ionized needs to be taken into account in the chemical formula that describes the ion for which the isotopic cluster is being calculated. For example, when determining the chemical formula of a protein in the positive ion mode, the number of protons used to ionize the protein need to be included in the analyte elemental composition formula.
The IsoSpec software is “charge-agnostic” in the sense that it does not know what element in the chemical formula is responsible for the ionization of the analyte. Therefore, IsoSpec does not know of (and does not care about) the charge of the analyte. The ionization level of the analyte can be handled by mineXpert if that information is set to the Ion charge spin box widget. By default, the charge state of the analyte is 1.
The Max. cumulative probability spin box widget serves to configure the extent to which IsoSpec simulates the theoretically expected isotopic cluster. A value of 0.99 tells the software to simulate enough combinations of the analyte isotopes to represent 99 % of the theoretically expected combinations.
For large biopolymers, it might be prudent to start with a relatively low value for Max. cumulative probability, because setting this value too high near 1 would increase notably the calculation duration.
To perform isotopic cluster calculations, the simulation software needs to be aware of all the isotopes of all the chemical elements that enter in the composition of the ionized analyte. An isotope is defined by its mass and by the probability that it is found in nature. Carbon has two major isotopes that can be found in nature: the 12C most abundant isotope and the 13C least abundant isotope.
There are three ways that the user might use the IsoSpec software package. Two of them involve a configuration preparation on the part of the user. The third one, that we'll describe first, does not involve any chemical element configuration. The other ways are reviewed in the next sections.
In order to document all the chemical elements' isotopes' characteristics, the IsoSpec library has in its own headers a number of arrays that mineXpert automatically loads up when opening the Figure 2.22, “Isotopic cluster calculation dialog window” dialog window. These data are displayed in the IsoSpec standard static data tab. The table view widget is not editable, hence the static qualifier in the tab name.
In this tab, all the user has to do is enter the formula for which the isotopic cluster calculation is to be performed. The formula needs to be pasted in the Formula line edit widget.
Once the formula has been set to its line edit widget, configure the charge of the ion (Ion charge spin box widget) and the Max. cumulative probability. Now, click .
While the previous section showed how to use the static element tables from the IsoSpec library, this section shows a way to configure these elemental data. For this, start from the standard IsoSpec-based data and modify them according to one's needs. To proceed, select all the element rows of interest from the IsoSpec standard static data tab and click .
The order with which the rows are selected is respected in the export, so make sure to select the rows in the oders that makes sense for your. Also be sure to select rows and not only some cells; for this click in the left margin of the table view widget.
The export is performed as a text TSV (tab-separated value format) file in a layout that closely mimicks the data visible in the table view. Once the file has been saved, open it up in LibreOffice and modify the data to suit your needs. For example, in case of a labelling with 13C, with an efficiency of almost 98 %, lets's change the 12C abundance (probability) to 0.02 and the 13C abundance to 0.98.
Now that the modifications were performed, save the file under the same format and load it in the IsoSpec standard user data tab of the dialog window by clicking Load table from file. The table view now shows the new carbon abundance values. From now on, use this tab exactly like you did for the standard isospec static data work described above.
Here also, it is possible to select determinate rows and then save them to a file for later use.
The last way the user might configure the chemical elements to be used during an isotopic cluster calculation is based on the fully manual description of the elements and of the isotopes. That configuration is performed in the Manual configuration tab of the dialog window. This method is slightly more involved than the previous one but provides also for a much greater flexibility: it allows one to create “new chemical elements” that might be required in specific labelling experiments. The manual configuration is carried over in the Manual configuration tab of the dialog, as shown in Figure 2.23, “Hand-made user configuration of the chemical elements and formula”
Upon creation of the dialog window, the Manual configuration tab is empty, with only two rows of buttons at the bottom of the tab. To start configuring chemical elements, you click to create an “element group box” that contains a number of widgets organized in two rows:
Top row, a line edit widget to receive the chemical element symbol, C in the example;
A spin box widget in which to set the number of such atoms in the formula for which the isotopic cluster is being calculated. In the example, we set this value to 769;
A button with a “minus” image that removes all the “element group box” in one go;
The bottom row contains an “isotope frame widget” with two spin boxes for the mass of the isotope being configured (left) and its corresponding abundance (right);
In addition to the spin boxes, two buttons, with a “plus” or a “minus” figure, allow one to respectively add or remove isotope frames.
It is not possible to remove all the isotope frames from an element group box, otherwise that group box would become useless.
Once an isotope frame has been filled-up, a new line might be required. To create a new isotope frame widget, click any “plus”-labelled button in any of the isotope frames. Once a new frame is created, the spin box widgets that it contains are set to 0.00000. Fill-in these spin boxes with mass and abundance and go on along this path to create as many isotopes as required.
Once all the isotopes for a given chemical element have been defined, a new element might be needed. For this, click
and start the configuration of the new element as described above.The manual isotopic configuration of the chemical elements required to perform an isotopic cluster calculation for a given formula is tedious. The user may want to save a given configuration to a file (click Save configuration) so that it is easier to recreate automatically all the widgets upon loading of that saved configuration (click Load configuration).
The final configuration is shown in Figure 2.24, “Typical manual configuration of the isotopic characteristics of the chemical elements”. The experiment that was configured above is a labelling of a glucose molecule with Cz, an imaginary chemical element that is like carbon but that has a 14C isotope. The glucose molecule (normal formula: C6H12O6) is labelled on one single carbon atom with an efficieny of 95 %. This means that, when the labelling fails (in 5 % of the cases) the carbon atom has its isotopes with usual probabilities (compounded by the fact that the normal atom is found at that position only in 5 % of the cases). The isotopic abundances for the Cz element are thus:
For the 12C isotope: 0.05 * normal 12C abundance;
For the 13C isotope: 0.05 * normal 13C abundance;
For the 14C isotope: 0.95;
Note that the “normal” carbon count is 5 (and not 6), that the hydrogen count is 13 (and not 12, because the glucose is protonated) and the labelling carbon is present only once.
Once the configurations have been terminated, the calculations can finally be performed by the IsoSpec library. In the manual configuration setting, the formula is automatically handled since each chemical element that is defined goes along with the count of the correponding atoms. In the case of the standard IsoSpec configuration (either modified or not), the user has to enter the chemical formula of the analyte in the Formula line edit widget.
Click Run. If the configuration was correct and IsoSpec could run the calculation properly, then the dialog window switches to the IsoSpec results tab (Figure 2.25, “Results from the isotopic cluster calculation”). That tab contains a text edit widget in which the results are displayed.
Note that the m/z values calculated by IsoSpec are “corrected” for the charge level that was specified in the left panel of the dialog window prior to their display in the results tab (Figure 2.22, “Isotopic cluster calculation dialog window”).
The IsoSpec library computes the relative probability of the various combinations of all the isotopes that make the chemical formula submitted to it. The results are in the form of peak centroid values along with corresponding probabilities. The sum of the probabilities corresponds to the Max. cumulative probability value that was set by the user.
The results that are produced by IsoSpec represent the peak centroids of the isotopic cluster. The results are thus a set of (m/z,i) pairs that have not the characteristic shape (the profile) that is found in mass spectra. mineXpert features the ability to give a shape to the centroids peaks. For that, click To peak shaper to open the Peak Shaper dialog window preloaded with the IsoSpec-generated peak centroids. The workings of this peak shaping feature is described in Section 2.12, “Shaping mass peak centroids into well-shaped peaks”.
The shape of mass peaks is typically Gaussian or Lorentzian (or a mix thereof). There are some data simulation or analysis processes that lead to having mass peaks characterized by a single centroid m/z value and a corresponding intensity. Plotted to a graph, a centroid mass peak yields a bar. In order to convert mass peaks centroid into something that resembles a real “profile” mass spectrum, a mathematical formula can be applied (with some parameters) to configure the shapes generated. mineXpert now includes that feature accessible via the menu “” →. The window that is opened is shown in Figure 2.26, “Setting-up of the centroid mass peak shaping process”
The mass centroid peaks are listed in the Data centroid points (m/z,i) text edit widget. These values are pasted there by the user or copied automatically from the isotopic cluster calculation dialog window (see Section 2.11.1.4, “The IsoSpec results are not shaped mass peaks”). The width of the “profile” mass peak is determined either by setting the resolution of the instrument (in the example that is set to 45000) or by setting the width of the peak at half maximum of its height (FWHM).
It might be useful to have and idea of the FWHM value for a given pair of m/z and resolution values when defining the parameters for the peak shaping process. Double-click-select a single mass peak centroid m/z value from the text edit widget, set the resolving power of your instrument and then click FWHM spin box widget.
. The computed FWHM value will be displayed next to the button. The text is selectable so as to copy it to the clipboard and then in theThe spectrum that is generated can be of a Gaussian or a Lorentzian shape. That parameter is configured by selecting the corresponding radio button widget. The number of points used to actually craft the shape of the peak is configurable. In the example, that parameter is set to 150. When the calculation is performed by clicking the Execute button, the mass spectrum that is calculated is displayed as a list of (m/z,i) pairs in the Results tab of the dialog window. In that tab widget, the Display mass spectrum button make the spectrum available in the Mass-spectrum window (see Figure 2.27, “Spectrum created using the peak shaping feature”).
Note that if the resolution asked is very high, the resulting shaped mass peaks might appear a bit “hairy”. By tweaking the Bin size value, the binning of the spectra might improve the situation. Otherwise, using the contextual menu in the mass spectrum graph to apply a Savitzky-Golay filter, described at Section 2.8.1.3, “Effects of the m/z integration parameters”, will certainly improve things. To achieve such a filtering process, right-click on the mass spectrum trace and choose the according menu item.
If the peak centroids were not “corrected” for their charge in the previous generation step (as in the case of the isotopic cluster calculation, it is still time to apply this “correction” by setting the charge in the Charge spin box widget. If the charge was already accounted for, as described in Section 2.11.1.4, “The IsoSpec results are not shaped mass peaks”, then leave the charge to 1 and the results will be correct.
When doing mass analysis work it is often desirable to store the painstakingly manually picked m/z or Mr values for later use. mineXpert provides a number of solutions to record the data mining work.
The simplest way to record any graph feature is to point that feature with the mouse and press the L key. That key shortcut prints to the console window the coordinates of the current mouse cursor location. To be able to trace back the graph source of that (x,y) pair, the text is printed in the console using the same color as the graph whence the labelling action came. The console is actually a rich text format editor in which it is possible to edit the text contents so as to copy/paste them in the lab-book or an email to a colleague, for example. This is shown in Figure 2.28, “Recording the peak feature coordinates to the console”. The label operation described here does not require any previous integration operation. This is in contrast to the requirements of the mass spectral data analysis recording described below.
The label recording process works without ambiguity when the cursor is located in the single-graph plot widgets. However, when the cursor is located in the multi-graph plot widget (top part of the window displaying TIC chromatograms, mass spectra or drift spectra) then, only the graph(s) currently selected in the Loaded mass spectrum files window is(are) concerned by the label operation.
In order to record the innumerous analysis steps that make a data mining session, the Figure 2.29, “Setting-up of the recording of the data mining discoveries”. In that window, the user can select the destination of the data analysis recording system: console, clipboard, file or any combination of the three. When selecting file recording, the user might specify if the recording whould overwrite any preexisting file or, instead, append to that file. Depending on the kind of graph where data mining occurs, the format of the data to be recorded needs to change. Indeed, it would make no sense to record the charge z when mining data in the Drift spectrum window. This is why the text format of the data export needs to be defined for each one of the three kinds of graphs: TIC/XIC chromatogram, mass spectrum or drift spectrum.
-> menu might be called to display the window shown inBefore delving into the configuration intricacies, let us tell immediately how to trigger the recording of the mining discoveries: using the Space bar.
It is possible to configure the recording system to record to either the console, the clipboard, a file (in append mode or in overwrite mode) or any combination thereof. The format of the string is defined using special characters (see text) and might be defined specifically for the three main graphs: TIC/XIC chromatogram, mass spectrum and drift spectrum.
The format used to define the text string to be stored on console and/or in file can contain particular tokens as described below:
%f : mass spectrometry data file name
%X : value on the X axis of the graph (no unit). For a drift spectrum, that would be drift times in milliseconds, for a mass spectrum, that would be m/z values, for a TIC/XIC chromatogram, that would be retention times in minutes;
%Y : value on the Y axis of the graph (no unit). In all the graph plots, that would be intensities in any unit provided by the mass spectrometer (typically, counts);
%x : delta value on the X axis (when appropriate, no unit)
%y : delta value on the Y axis (when appropriate, no unit)
%I : TIC intensity after TIC integration over a graph range
%s : X axis range start for computation (where applicable, for example for the TIC integration to a single value;)
%e : X axis range end for computation (where applicable);
For mass spec plot widgets:
%z : charge
%M : Mr as computed during deconvolution (see Section 2.10, “Mass spectral feature analysis”).
It is important to keep in mind that the %z and %M format strings can only work if the user is actually analyzing a mass spectrum and if the user has effectively performed a deconvolution operation that has allowed computing these two values. If the values are not available, the program shows nan (“not a number”) in the textual output upon hitting the space bar (see below).
In the drift spectrum window, the data
recording processes data matching the cursor position at the last
-single-click. The program tries to define the intensity by
looking at the graph ordinate (y axis) matching the nearest abscissa
point (x axis) to the last
-clicked location.
Also, as stated above for the simple labelling of cursor location points (see Section 2.13.1, “Feature labelling to the console window”), the recording of data analysis steps work both in the multi-graph plot widgets (those at the top of the plot windows) and in the single-graph plot widgets (those at the bottom of the windows). When doing data analysis in the top multi-graph plot widget, it is necessary to select the traces to be analyzed in the Loaded mass spectrum files window, otherwise no data will be recorded. This is of course not necessary when working in bottom plot widgets because, in that case, there is no ambiguity on what data to record.
Once configured, the format strings might be stored in a drop down box for later use. To that end, click onto the Add to history button while having the format text displayed in the text editor and it will be appended to the drop-down list. The list gets stored when the dialog window is closed and will be filled-up again when the program is restarted.
As an example, if the user defined the following format string for a mass spectrum graph:
Mass spec. :
mz = (%X, %Y) z = %z
filename = %f
date = 20161021
session = 20161021
mslevel = 1 msion = esi msanal = tof
chrom = DEAE fraction = 25
seq = pos = oxlevel = 0 pos =
intensity =
comment =
Then, a resulting data mining stanza that would be recorded will look like this:
Mass Spec. :
mz = (1051.8, 50863) z = 1
filename = 20161017-rusconi-frac-25-deae-20160712.db
date = 20161021
session = 20161021
mslevel = 1 msion = esi msanal = tof
chrom = DEAE fraction = 25
seq = pos = oxlevel = 0 pos =
intensity =
comment =
Interestingly, the user can define any kind of format, leaving fields available for later filling-in. This feature is of immense value when the analysis file is used later to fill-in a database for easy storage and interrogation of the mining discoveries. In this case, it would be useful to have the file opened in an editor and at each new stanza edit the comment field if something needs to be commented, like the shape/intensity of a mass peak, for example.
Note that the program closes the file each time a new stanza has been written. This makes it possible to edit that file safely in between each stanza record. Remember to force the editor to reload the file from disk after each mining discovery recording.
When the recording involves sending the analysis data to the console, the data are sent to it as text colored the same as the spectrum that was under scrutiny.
When the mouse cursor has been placed at the proper location on the
graph (with or without -click-dragging, depending on the
situation), the user hits the space bar and the data analysis stanza is
recorded to the selected destination(s): console, clipboard,
file.
At the moment, this feature is only available for mass data files in the SQLite3 format, as obtained via the mineXpert conversion feature described at Section 2.15, “Converting mzML to SQLite3”.
Mass spectrometry data acquisitions performed in line with a chromatography setup generally yields massive data files holding mass data acquired all along a chromatographic development. Depending on the application, the data in the obtained file might not be of interest throughout all the acquisition duration. For example, a size exclusion chromatography might have resolved the molecular species of interest only in a very short retention time range. In this case, it might be useful to extract the data corresponding to that retention time range of interest from the initial very large file and store them in another file that will be sufficiently light to be loaded and analyzed quickly and easily.
The data file slicing feature is called by using the
-click the plot widget of interest) either from the
TIC chromatogram plot widget corresponding to the
file to be sliced or from the Drift spectrum plot
widget of interest.
mineXpert can export data according to various modes, as illustrated in Figure 2.30, “Setting-up of the data export”.
The easiest operation is to first zoom in the relevant data in
the TIC chromatogram plot widget and leave the
default Export the whole displayed region in a single
file checkbox checked. This process basically prunes all the
data outside of the currently zoomed-in region. It is possible to export a
TIC chromatogram range different than the currently displayed one by
entering the Range start and Range
end values in the respective spin boxes. For this, uncheck the
Export the whole displayed region in a single file
checkbox and set the Slice count value to
1
.
Another way to operate the slicer is to uncheck the Export the whole displayed region in a single file checkbox and define either the number of slices to be generated or their size. Depending on the “slicing” configuration, the program will calculate the missing configuration bits to perform the required action. If the user specifies the number of slices, the size of the slices is automatically calculated. Conversely, if the user specifies the size of the slices, the number of slices is deduced.
Note that the Range start and Range end values, corresponding to the limits of the currently zoomed-in data range, are always honored when performing the slice number or slice size computations mentioned above.
The file naming pattern used for the various data files generated in the process of the data export is governed by the format string displayed at the top of the window. By default, the generated files are located in the same directory as the source data file. If a new directory is to be used, it can be selected by pressing the Directory... push button.
Once the configuration is done, press the Validate push button to have a preview of the various file names that have been choosen for the various data slices to be written to. If the configuration is correct, click the Confirm and start data export button.
Note that the data export feature can only be used with the following two data ranges:
Retention time ranges: the data export configuration window must be activated from one of the TIC chromatogram plot widgets;
Ion mobility drift time ranges: the data export must be triggered from one of the Drift spectrum plot widgets;
This is because it does not make sense to split up a given data file into smaller files on the basis of m/z data ranges.
The mzML format is very verbose and parsing it causes a notable delay during loading of mass spectrometry files. mineXpert allows one to convert mzML files to a private open file format based on the SQLite3 database software.
Using the SQLite3 format also allows to slice very large data files into smaller files on the basis of user-selected criteria (see Section 2.14, “Splitting files into smaller chunks”).
For this feature, mineXpert must be run in a system console window. To show a detailed help, type the following:
minexpert --help
Use the following parameters (or flags) to perform a data file conversion:
minexpert -x -o <db file name> <mzML file
name>
For example, to convert file test-file.mzml
into
test-file.db
, the command line would be:
minexpert -x -o test-file.db test-file.mzml
Alternatively, the -o
flag can specify a
directory, in which case the new file name is crafted from the
mzML file and written into that directory. In
that case, the extension of the mzML file needs
to be either mzML or
mzml for the automatic renaming to
occur.
Note that in batch conversion is possible using this kind of command line:
minexpert -x -o /tmp
/home/<user>/lab/mzml/*.mzml
In this case, automatic file renaming happens and the new
db files are all stored in the tmp
directory.