Ingest the RxRx1 dataset¶

!lamin init --storage test-rxrx --schema bionty,wetlab

import lamindb as ln
import bionty as bt
import wetlab as wl

ln.track("Zo0qJt4IQPsb0000")

The metadata.csv was originally downloaded from here and deposited on S3.

Load metadata¶

Read in the raw metadata of the wells:

meta = ln.Artifact(
    "s3://lamindata/rxrx1/metadata.csv",
    description=(
        "Experimental design of RxRx1, e.g. what cell type and"
        " treatment are in each well."
    ),
).load()
meta.head()

Show code cell output Hide code cell output

	site_id	well_id	cell_type	dataset	experiment	plate	well	site	well_type	sirna	sirna_id
0	HEPG2-08_1_B02_1	HEPG2-08_1_B02	HEPG2	test	HEPG2-08	1	B02	1	negative_control	EMPTY	1138
1	HEPG2-08_1_B02_2	HEPG2-08_1_B02	HEPG2	test	HEPG2-08	1	B02	2	negative_control	EMPTY	1138
2	HEPG2-08_1_B03_1	HEPG2-08_1_B03	HEPG2	test	HEPG2-08	1	B03	1	treatment	s21721	855
3	HEPG2-08_1_B03_2	HEPG2-08_1_B03	HEPG2	test	HEPG2-08	1	B03	2	treatment	s21721	855
4	HEPG2-08_1_B04_1	HEPG2-08_1_B04	HEPG2	test	HEPG2-08	1	B04	1	treatment	s20894	710

It seems that the column storing cell lines is erroneously called cell_type. Also dataset refers to something that’s typically called split. Let’s rename it:

meta.rename({"cell_type": "cell_line", "dataset": "split"}, axis=1, inplace=True)

Add a paths column - this is an aggregate over 6 paths for 6 channels. We’ll deconvolute further down:

paths = []
for _, row in meta.iterrows():
    well = row.well
    site = row.site
    paths.append(
        f"images/{row.split}/{row.experiment}/Plate{row.plate}/{well}_s{site}_w1-w6.png"
    )
meta["paths"] = paths

Use more meaningful plate names:

meta["plate"] = meta["plate"].apply(lambda name: f"Plate{name}")

Create a DataFrame with each row as a single image, similar to a link table but with multiple metadata columns:

meta_with_path = meta.copy()
keys_list = []
for key in meta_with_path["paths"]:
    keys = [key.replace("w1-w6.png", f"w{str(channel)}.png") for channel in range(1, 7)]
    keys_list.append(keys)
meta_with_path["path"] = keys_list
meta_with_path = meta_with_path.explode("path").reset_index(drop=True)
del meta_with_path["paths"]
meta_with_path

Show code cell output Hide code cell output

	site_id	well_id	cell_line	split	experiment	plate	well	site	well_type	sirna	sirna_id	path
0	HEPG2-08_1_B02_1	HEPG2-08_1_B02	HEPG2	test	HEPG2-08	Plate1	B02	1	negative_control	EMPTY	1138	images/test/HEPG2-08/Plate1/B02_s1_w1.png
1	HEPG2-08_1_B02_1	HEPG2-08_1_B02	HEPG2	test	HEPG2-08	Plate1	B02	1	negative_control	EMPTY	1138	images/test/HEPG2-08/Plate1/B02_s1_w2.png
2	HEPG2-08_1_B02_1	HEPG2-08_1_B02	HEPG2	test	HEPG2-08	Plate1	B02	1	negative_control	EMPTY	1138	images/test/HEPG2-08/Plate1/B02_s1_w3.png
3	HEPG2-08_1_B02_1	HEPG2-08_1_B02	HEPG2	test	HEPG2-08	Plate1	B02	1	negative_control	EMPTY	1138	images/test/HEPG2-08/Plate1/B02_s1_w4.png
4	HEPG2-08_1_B02_1	HEPG2-08_1_B02	HEPG2	test	HEPG2-08	Plate1	B02	1	negative_control	EMPTY	1138	images/test/HEPG2-08/Plate1/B02_s1_w5.png
...	...	...	...	...	...	...	...	...	...	...	...	...
753055	U2OS-03_4_O23_2	U2OS-03_4_O23	U2OS	train	U2OS-03	Plate4	O23	2	treatment	s21454	509	images/train/U2OS-03/Plate4/O23_s2_w2.png
753056	U2OS-03_4_O23_2	U2OS-03_4_O23	U2OS	train	U2OS-03	Plate4	O23	2	treatment	s21454	509	images/train/U2OS-03/Plate4/O23_s2_w3.png
753057	U2OS-03_4_O23_2	U2OS-03_4_O23	U2OS	train	U2OS-03	Plate4	O23	2	treatment	s21454	509	images/train/U2OS-03/Plate4/O23_s2_w4.png
753058	U2OS-03_4_O23_2	U2OS-03_4_O23	U2OS	train	U2OS-03	Plate4	O23	2	treatment	s21454	509	images/train/U2OS-03/Plate4/O23_s2_w5.png
753059	U2OS-03_4_O23_2	U2OS-03_4_O23	U2OS	train	U2OS-03	Plate4	O23	2	treatment	s21454	509	images/train/U2OS-03/Plate4/O23_s2_w6.png

753060 rows × 12 columns

Validate and register metadata¶

rxrx_curator = ln.Curator.from_df(
    meta_with_path,
    categoricals={
        "cell_line": bt.CellLine.name,
        "split": ln.ULabel.name,
        "experiment": wl.Experiment.name,
        "plate": ln.ULabel.name,
        "well": wl.Well.name,
        "well_type": ln.ULabel.name,
        "sirna": wl.GeneticPerturbation.name,
    },
)

rxrx_curator.validate()

Show code cell output Hide code cell output

• saving validated records of 'cell_line'

✓ added 4 records from public with CellLine.name for "cell_line": 'U-2 OS cell', 'Hep G2 cell', 'hTERT RPE-1 cell', 'HUV-EC-C cell'

• mapping "cell_line" on CellLine.name

!   4 terms are not validated: 'HEPG2', 'HUVEC', 'RPE', 'U2OS'
    4 synonyms found: "HEPG2" → "Hep G2 cell", "HUVEC" → "HUV-EC-C cell", "RPE" → "hTERT RPE-1 cell", "U2OS" → "U-2 OS cell"
    → curate synonyms via .standardize("cell_line")

• mapping "split" on ULabel.name

!   2 terms are not validated: 'test', 'train'
    → fix typos, remove non-existent values, or save terms via .add_new_from("split")

• mapping "experiment" on Experiment.name

!   51 terms are not validated: 'HEPG2-08', 'HEPG2-09', 'HEPG2-10', 'HEPG2-11', 'HUVEC-17', 'HUVEC-18', 'HUVEC-19', 'HUVEC-20', 'HUVEC-21', 'HUVEC-22', 'HUVEC-23', 'HUVEC-24', 'RPE-08', 'RPE-09', 'RPE-10', 'RPE-11', 'U2OS-04', 'U2OS-05', 'HEPG2-01', 'HEPG2-02', ...
    → fix typos, remove non-existent values, or save terms via .add_new_from("experiment")

• mapping "plate" on ULabel.name

!   4 terms are not validated: 'Plate1', 'Plate2', 'Plate3', 'Plate4'
    → fix typos, remove non-existent values, or save terms via .add_new_from("plate")

• mapping "well" on Well.name

!   308 terms are not validated: 'B02', 'B03', 'B04', 'B05', 'B06', 'B07', 'B08', 'B09', 'B10', 'B11', 'B12', 'B13', 'B14', 'B15', 'B16', 'B17', 'B18', 'B19', 'B20', 'B21', ...
    → fix typos, remove non-existent values, or save terms via .add_new_from("well")

• mapping "well_type" on ULabel.name

!   3 terms are not validated: 'negative_control', 'treatment', 'positive_control'
    → fix typos, remove non-existent values, or save terms via .add_new_from("well_type")

• mapping "sirna" on GeneticPerturbation.name

!   1139 terms are not validated: 'EMPTY', 's21721', 's20894', 's19827', 's19792', 's19935', 's21398', 's223097', 's348', 's19975', 's19911', 's21543', 's195030', 's20290', 's20345', 's20305', 's20110', 's21048', 's20519', 's21045', ...
    → fix typos, remove non-existent values, or save terms via .add_new_from("sirna")

False

rxrx_curator.standardize("cell_line")
rxrx_curator.add_new_from("split")
rxrx_curator.add_new_from("experiment")
rxrx_curator.add_new_from("plate")
rxrx_curator.add_new_from("well_type")
# well requires row and column information so we'll create records manually
# sirna requires system information so we'll create records manually

`well`¶

We also want to add the well information to link image files and parse images based on well coordinates. We first extract well locations from the table:

# Temporarily disable synonyms search to reduce standard output
ln.settings.creation.search_names = False
wells = [
    wl.Well(name=well, row=well[0], column=int(well[1:]))
    for well in meta["well"].unique()
]
ln.save(wells)
ln.settings.creation.search_names = True

`sirna`¶

Add sirna to GeneticPerturbation table:

# Temporarily disable synonyms search to reduce standard output
ln.settings.creation.search_names = False
sirnas = [
    wl.GeneticPerturbation(
        name=sirna,
        system="siRNA",
    )
    for sirna in meta["sirna"].unique()
]
ln.save(sirnas)
ln.settings.creation.search_names = True

`cell_line`¶

Add commonly used abbreviations:

bt.CellLine.get("30n7ByjL").set_abbr("HUVEC")
bt.CellLine.get("6EK4GXdy").set_abbr("U2OS")
bt.CellLine.get("og6IaxOV").set_abbr("RPE")
bt.CellLine.get("4ea731nb").set_abbr("HEPG2")

Register metadata file¶

meta_af = rxrx_curator.save_artifact(
    key="rxrx1/metadata.parquet",
    description="Metadata with file paths for each RxRx1 image.",
)

# Add a `readout` label using The `Experimental Factor Ontology`:
readout_feat = ln.Feature(name="readout", dtype="cat").save()
readout = bt.ExperimentalFactor.from_source(name="high content screen").save()
meta_af.labels.add(readout, readout_feat)

meta_af.describe()

Show code cell output Hide code cell output

Artifact .parquet/DataFrame
├── General
│   ├── .uid = 'ObUn5B5Ytjp8Ucx90000'
│   ├── .key = 'rxrx1/metadata.parquet'
│   ├── .size = 5720995
│   ├── .hash = '8KaHcVpmukO1EhY2qL7IGA'
│   ├── .path = 
│   │   /home/runner/work/lamin-spatial/lamin-spatial/docs/notebooks/test-rxrx/.lamindb/ObUn5B5Ytjp8Ucx90000.parque
│   │   t
│   ├── .created_by = testuser1 (Test User1)
│   ├── .created_at = 2024-12-20 15:37:09
│   └── .transform = 'Ingest the RxRx1 dataset'
├── Dataset features/.feature_sets
│   └── columns • 7                 [Feature]                                                           
│       cell_line                   cat[bionty.CellLine]       HUV-EC-C cell, Hep G2 cell, U-2 OS cell,…
│       experiment                  cat[wetlab.Experiment]     HEPG2-01, HEPG2-02, HEPG2-03, HEPG2-04, …
│       plate                       cat[ULabel]                Plate1, Plate2, Plate3, Plate4           
│       sirna                       cat[wetlab.GeneticPertur…  EMPTY, n337250, s1174, s12279, s134, s13…
│       split                       cat[ULabel]                test, train                              
│       well                        cat[wetlab.Well]           B02, B03, B04, B05, B06, B07, B08, B09, …
│       well_type                   cat[ULabel]                negative_control, positive_control, trea…
├── Linked features
│   └── readout                     cat[bionty.ExperimentalF…  high content screen                      
└── Labels
    └── .experiments                wetlab.Experiment          HEPG2-01, HEPG2-02, HEPG2-03, HEPG2-04, …
        .wells                      wetlab.Well                B02, B03, B04, B05, B06, B07, B08, B09, …
        .genetic_perturbations      wetlab.GeneticPerturbati…  EMPTY, n337250, s1174, s12279, s134, s13…
        .cell_lines                 bionty.CellLine            U-2 OS cell, Hep G2 cell, hTERT RPE-1 ce…
        .experimental_factors       bionty.ExperimentalFactor  high content screen                      
        .ulabels                    ULabel                     train, test, Plate3, Plate2, Plate4, Pla…

Register images¶

ln.UPath("gs://rxrx1-europe-west4/images").view_tree(level=2)

Take a subset to run on CI:

images = ln.Artifact(
    "gs://rxrx1-europe-west4/images/test/HEPG2-08", description="RxRx1 image files"
)
images.n_objects

images.save()

collection = ln.Collection(
    images, name="Annotated RxRx1 images", meta_artifact=meta_af, version="1"
)
collection.save()

collection.meta_artifact

collection.data_artifact

collection.describe()

Ingest the RxRx1 dataset¶

Load metadata¶

Validate and register metadata¶

well¶

sirna¶

cell_line¶

Register metadata file¶

Register images¶

`well`¶

`sirna`¶

`cell_line`¶