Analysis of Word frequency count for 10 TB file with words from Oxford Dictonary [on hold]

I have been asked to write my analysis about

for counting word frequency file with words from the Oxford dictionary with one word per line and total size of file being 10 terabyte. And it contains every word from the dictionary at least once.

How should I approach this problem?

I was thinking of starting with how many words an Oxford dictionary has and what the average length of the word is, and calculating the number of bytes used by each word on each line.

Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Avoid asking multiple distinct questions at once. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.

This is probably far too open ended right now to be suitable as a question here. Have you been asked to write about a concrete program already written, or about how you would write a program yourself? If the latter, are there any approaches you might have taken if the file weren't 10 TB, that you can no longer take knowing how big the file is?
– hvd
Jun 30 at 6:46

@hvd I have reframed the question. Can you please check again
– Rohan
Jun 30 at 6:54

You choose a language which supports reading of files, dictionaries, and string processing. And one you know. The last point is the most important one, since any serious program language today fulfills the former points, even VBA.
– Doc Brown
Jun 30 at 7:01

.. and "platform" - the given kind of problem can be solved on any major desktop or server system (ok, doing this on a typical mobile phone platform mightnot be feasible today). The limiting speed factor is probably disk I/O, so storage system and adapter is what you should be worried about.
– Doc Brown
Jun 30 at 7:10

1 Answer
1

Here are some questions to consider:

How many distinct words are there in the OED? Presumably you are not being asked to recognise cognates, so "run", "runs", "runner", "running" and "ran" will all be distinct words.

Hence how big is your table of word counts going to be in megabytes? What is the overhead on storing lots of small strings in various programming languages? If you use a library or built-in dictionary type (like Python) then how much overhead does that take up? Could you design a more efficient data structure? What data structures have been invented in the past for doing this? (Hint: fitting a spell checker on to a single floppy disk used to be an important thing).

How long will it take to process each word? Are you familiar with "big Oh" notation, e.g. O(n) and O(log n)? If not then try reading this.

As others have noted, merely reading 10Tb from a disk drive takes a non-trivial amount of time. You should consider how to balance the speed of the program against the speed of storage.

What if all the words are "a"? In that case you might have to count to 5 billion. But a 32 bit unsigned Int only holds numbers up to 4 billion. Make sure you allow for 64 bit integers. You should get in the habit of thinking about this kind of question: I've seen a number of disasters of various size being caused by arithmetic overflows. E.g. here and here.

From what I calculated 10 TB = 10^13 bytes Number of words in OED = 200,000 Avg word length = 5 character (5 bytes) Approximate number of words that can accomodate in a 10 TB file 10^13/(5+1) -> 1 extra byte for new line Can you tell me if I am going correct ?
– Rohan
Jun 30 at 11:07

@Rohan. That sounds about right. However you haven't been given any information about what these words are, so it might contain a lot of very long words. The 5 characters per word is right for ordinary English. However this is because common words are shorter, so English text reuses the same short words. But in your counted index each short word only occurs once, the same as each long word. So when calculating the memory for your index you are going to have a different average.
– Paul Johnson
Jun 30 at 11:26

In the question I have mentioned that every word from OED will appear at least once. I thought of creating a Map with 200,000 keys and value as Long for keeping the count. I am trying to calculate what memory will this Map use with such a large number of keys and will it be efficient to do it ?
– Rohan
Jun 30 at 11:48

@Rohan. Ahh, I missed that point. Sorry. One way to find out how big it will be would be just to create one in Python and measure it (i.e. find out how much memory you are using before and after).
– Paul Johnson
Jun 30 at 11:50

@Rohan, thats for you to research. I've given you some pointers; now go do the work.
– Paul Johnson
Jun 30 at 11:52

搜尋此網誌

Rtyjn