Researching non-Latin script searches in an English-language Library search

May 06, 2024

BY Meg Mcmahon

THE QUESTION

In Fall 2023, the URC undertook two studies: one to explore Library staff's perceptions of HOLLIS when searching with non-Latin scripts and the other to examine the search strategies employed by non-staff users when looking for non-Latin script materials in HOLLIS.

Katie Kim and I spoke with experts in Cyrillic, Japanese, Hebrew, Arabic, Chinese, and Korean scripts. We chose these scripts because Google analytics data show they are HOLLIS's most searched non-Latin script languages.

THE METHODOLOGY

We interviewed two groups: 10 library staff who work with non-Latin scripts and 9 researchers who search for non-Latin materials. The latter were recruited by asking the library staff who they would suggest for an interview and received a $25 Amazon gift card for their time.

The interview questions for staff focused on understanding how they search for non-Latin materials and what strategies they use within HOLLIS to find known items or topics within non-Latin script languages. We also asked how they work with researchers to help them find items in non-Latin languages and ways they think we could improve cataloging and discovery of these materials.

We conducted a contextual inquiry for the researchers to understand how they searched for materials related to their research. Follow-up questions related to what is or is not working and their work-arounds within HOLLIS to find their materials.

THE RESULTS

Understanding the basics of cataloging and language highlighted throughout the research can help one understand why HOLLIS searching for non-Latin materials can be difficult. It is important to state that this is not a fault of the language; it has to do with how the language fits (or doesn’t) within the current romanization rules. Romanization is the conversion of text from a different writing system to the Roman (Latin) script or a system for doing so.

Language resists romanization due to dialects, colloquial pronunciation, or tonal pronunciation. This affects the following texts: Chinese, Arabic, Persian, Coptic, Armenian, Hebrew, and Yiddish. For example, Yiddish and Hebrew have multiple ways of correctly spelling the same word, with vowel indicators and without. This can create confusion for folks searching for the languages.

Languages created using two different language rules lead to inconsistent standardization and, ultimately, romanization. For example, the Coptic language uses Arabic characters but a combination of English and Arabic language rules. This leads to unique spellings of words that most Arabic scripts, besides Coptic languages, don’t use. These different kinds of spellings are hard to romanize and search for. This affects the following texts: Yiddish, Coptic, and Persian.

Language and cataloging practices change over time, creating differences in romanization. This primarily affects older materials in HOLLIS today, but there were many conversations about how romanization tables have changed. A romanization table is a system for converting one type of script to another. For example, it was noted that anything pre-computers is often not cataloged at current standards, which is no one’s fault; it is the nature of time moving forward.

Search habits

Participants expect to have to revise their search terms and use other search methods when searching non-Latin scripts because slight differences in approach lead to vastly different search results. The insights below support this:

When asked to search for a non-Latin script item, 9/13 participants started with the script of their language; for example, if someone was searching for Hebrew materials, they used Hebrew script.
Participants filled in language knowledge gaps with Google or Worldcat.
Languages (Arabic/Persian, Chinese/Japanese/Korean) that share characters or words can confuse researchers when searching.
Participants will continually revise their search query romanization if they do not find materials they believe should be in our collections. Often, they never look at the Library of Congress's romanization; they do this based on their understanding of romanization, even if they are aware of LC as a standard.
Metadata chasing from one record is done to varying degrees of success. There are two main issues: subject headings at times do not match a native speaker's understanding of the item subject, and sometimes, there are no other items that have those subject headings because they were not chosen for other records.
Participants use filters to narrow down information; for example, the publisher filter helps if the keywords are vague and could be used in many countries.

THE CONCLUSION

Throughout the process, I gained a deeper understanding of and appreciation for how our metadata specialists and librarians who support non-Latin language research work with non-Latin languages. They are doing great work with a highly layered challenge of romanizing non-Latin script and helping researchers find materials.

In conclusion, I wanted to present this project's recommendations:

Continue to catalog using the parallel fields. Many participants notice and appreciate parallel records.
What if we added a bullet point to the “no results found” screen directly related to languages, prompting users to contact a subject librarian?
What if we linked records of different text translations from the original language if they exist?
What if we considered AI-powered solutions for subject headings of non-Latin materials or fixing backlog materials?
What if we included information on how to reach out to a non-Latin expert in our classroom work?
What if we collaborate with librarians in other countries to create other scripts?
What if we allowed the selection of general topics within subject headings in HOLLIS, not the whole string?