Corpus & Computational Linguistics program suggestions


I am new to this forum, and looking for recommendations.

I completed six years of fieldwork on Shangan Makhuwa, a dialect of Makhuwa spoken on the Northeast coast of Mozambique, in 2010. Since then, I have experienced many personal obstacles and interruptions, but am now free to return to my work.

In addition to producing a dictionary and grammar on the Shangan dialect, I have also produced over 380 hours of transcribed data, with the aid of six field assistants. I believe the best way to analyze all of this will be via the fields of corpus and computational linguistics. However, I have never worked in these fields before, and so I thought that I would ask if anyone here could recommend a program or programs that I could set up to process data on a little-studied language. It will be an evolving methodology, looking into such things as collocations, topic modeling, and semantic fields, so I would need something fairly open and flexible. I will also continue adding to my dictionary and grammar as I work though the transcripts. All of the transcripts were derived from high-quality WAV files, so I could also go back and reanalyze say, greater phonetic detail, or time alignment, if necessary.

So far, I have come across this book:

“Natural Language Processing and Computational Linguistics: A practical guide to text analysis with Python, Gensim, spaCy and Keras.” By Bhargav Srinivasa-Desikan.

As well as the program AntConc.

If anyone has any suggestions along these lines, or can recommend someone with experience in using these kinds of programs, that’d be greatly appreciated! It would be nice to be confident of my options before committing much more time to them.

Thank you for your time and attention!


Hi, welcome to the forum. I’m not an NLP person myself, although there are several here.

One resource you might be interested in is NLTK, which has been around for a while but has a nice book to go along with it that’s free:

I believe the library (it’s in Python) is still actively maintained, and if I recall correctly there are resources on this like collocations, etc.

Congrats on all the work you and your colleagues have done, sounds amazing. Feel free to share more about the research here if you like.

1 Like

Thanks Pat, I appreciate your input and encouragement!

After a few more days looking into options, I’m thinking a programming newbie like me might be best off using a program that’s already set up with a user interface, like AntConc. I’ve also just heard of Sketch Engine. The latter appears to rely on online use and storage. As an older guy, I don’t know if I like the idea of doing my work online instead of on my desktop. Does anyone have an opinion about these or other programs? I came across this summary: The Best Free Discourse Analysis Tools - Speak Ai, which doesn’t really give enough insight to make a decision. Looks like I’d have to try out each one a bit to get a sense of which one makes the most intuitive sense to me.

If I go this route, and decide to start with one of more of the user-friendly options, then, once I learn the basics, and try some of the common NLP features on my data, I could then dedicate some more time to learn more about coding if I find I need a feature or command with greater refinement such as those offered by NLTK. In any case, it seems the consensus is that Python is the easiest language to base a more tech savvy approach on, right?

Over the weekend I worked my way through about half of the book I mentioned above, just to get a sense of what goes into these tools. The author uses Python, Gensim, spaCy and Kera. Do you think it would be advantageous to familiarize myself with a more complete package like NLTK right from the start?

As a new user on this forum, I’m not allowed to upload documents, but if anyone is interested, I could email them a dissertation prospectus from 13 years ago(!), just in case it’s interesting and/or helps people on this forum get a better sense of what sort of tools/programs would be of most use to me.

Here’s a snippet of transcribed data as well:

Ph’aamantari ukhuma mmwaani’mmo’mmo Mwiinanenu waHuNnansure, … (4s) vano Hu…HuNnansure t’aamantari mpaka uYookola, mmakhuwani ţo mmyaakoni phw’aamantari’nnye

É que mandava a partir dessa região limite no régulo Nansure, …(4s) então rég…régulo Nansure é que mandava até Yocola, primeiro no interior nas montanhas ele é que estava a mandar.

Thanks for your time!


1 Like