Writing

Delve · KAUST · 2018–2019

What I learned building Delve

Reflections on building a dataset-driven scholarly search engine: the interface problems that shaped it, and what it taught me about research infrastructure.

scholarly searchdatasetsresearch infrastructure

Delve started from a frustration I kept hearing from researchers: I know the dataset I want exists, but I cannot find it.

Academic search is document-centric. You search for papers, not for the artefacts inside them. But a researcher looking for a benchmark, a corpus, or a specific data collection is not looking for the paper that introduced it — they are looking for the thing itself, described accurately enough to know it is what they need.

The insight that shaped Delve is that datasets have a natural citation graph of their own. Papers that use a dataset cite the paper that introduced it. Papers that compare methods on the same benchmark cluster together. If you model that structure, you can surface relationships that keyword search misses entirely.

The engineering challenge was extraction quality. Dataset mentions in papers are inconsistent — sometimes a formal name, sometimes a description, sometimes just an implicit reference. We combined NLP extraction with the citation graph to resolve ambiguity: if two papers describe what sounds like the same dataset and share a citation ancestor that introduced it, they probably are referring to the same thing.

The publication that came out of this — presented at ACM SIGKDD Explorations and ECML PKDD — validated the approach. But what I remember most is how much the interface problem drove the system design. Once you commit to helping researchers find datasets, every technical choice becomes a question about what researchers actually need to know: provenance, usage context, size, format, licence. Not just a ranked list of papers.