Category: research

Edx + Microsoft Course on R

I participated in some courses in Coursera, with very interesting subjects. For the first time I decided to enroll in a course in edx, and I ended up in a Microsoft Data Camp course on R. The course is very basic, mostly on syntax than in statistics. Good enough. The course is based in 5 to 6 minutes videos, and then some exercises done online. Although too basic, the course was interesting enough for me, as I did not know anything on R.

But what really annoyed me was the final evaluation. First, the instructions claim we will have 4 minutes per question… but the timer starts at 3 minutes. Well.. Microsoft… Then, instead of having questions on using R for some interesting tasks (take this data, change it, then plot) the exam is a set of puzzle questions, with placeholder we need to fill to obtain some result. Yes, when using R we will be faced with puzzles. And worse, there is no pause button. I understand you might want to limit the time to answer to each question, but why not to pause in between questions? Specially when you are at home and someone rings your door bell. Well, after 20 minutes or so at the door, I returned to the exam, and still got 75%.

And, OK, I did not pay for a verified certificate. But at least you could show a digital certificate we might print and show. OK, I’ll get back to Coursera. See you, edX!

Digital Object Identifier

If you are an academic, you know what DOI are, and you know that a lot of websites are now requiring your publications to have one, so you can refer to them (for example, Publons). DOI are managed by IDF, the International DOI Foundation,  a not-for-profit organization.

In the other hand, I am a co-editor for Linguamática, a free and open access journal. It exists for eight years, and never got any funding. Editors and reviewers are not paid. Publication if free and contents access too.

For some time that I want to add DOI to Linguamática. But all DOI registration agencies that I consulted have only paid membership, and fees for each DOI creation. This is not possible for Linguamática, unless we require fees to the authors or to the readers. And for us is better to keep without a DOI, than to change our policies.

Yesterday I tried to contact IDF directly, asking if there was any way to get a DOI:

Dear Sirs,

I am one of the editors of Linguamática (http://linguamatica.com). This journal is in its eighth year of existence, without any fee for authors or readers. We do everything for the evolution of the natural language processing area for free.

As you know, a lot of services require DOI identifiers. Unfortunately we do now have any means to pay for the DOI services from one of the registration agencies. Does DOI/IDF has any service for this kind of initiative?

Thank you

Unfortunately this was the answer I got:

Hello,

Unfortunately, no. In order to acquire a DOI you must work with an existing RA, and it is up the each RA to establish a business plan and set pricing. Crossref offers low pricing for your type of journal, but I doubt they assign DOIs at no cost at all.

And as you might guess, RA (registration agencies) are not not-for-profit organizations. As an example, Crossref has an annual fee of 275 dollars, with an extra dollar for each deposited article (Linguamática publishes from 10 to 20 articles, top, each year). So, we would require around 300 dollars each year to have a DOI. Unfortunately that is not possible.

I wonder if other open access journals have similar problems, and how they solved the problem. Or if someone else thinks that an Open-DOI is needed, supported by volunteers and minimum monetary contributions for server and domain expenses…

Open-Access Journals

It seems that Havard is fearing the Open-Access journals, and decides to do a false research project to shown that Open-Access Journals accepts bad/wrong papers. They submitted a paper for some Open-Access journals and found out that a high percentage accepted this paper.

What is people is claiming now? That Open-Access Journals does not do peer review, or do that poorly, and they are more interested in publishing anything than to access the quality of the document being published.

The fact is that this study is a mistake from the beginning. What does a percentage says? If I say that 80% of woman cheat their husbands, would this say they are worst than man? Probably man percentage is higher, but if it wasn’t computed, there is no way to compare.

So, what does it mean that a lot of Open-Access journals accept innacceptable papers? Just that. It doesn’t mean regular journals are better, or that conferences are better. You might know that there is a complete industry behind conferences (I organized a bunch of them, higher cost so far was 120 euro, and I offered lunches and dinner.. other I attend cost more than 500 euro and offer nothing at all!). The same happens with regular journals, and with Open-Access journals, of course.

But please, do a valid research. Take the article and submit to the same number of Open-Access journals, standard journals, and conferences. Then compare the results. That is research! Computing a meaningfull means nothing.

Please do not blacken Open-Access journals. They are the way to go for public research!

Map-Reduce, or why I hate software patents.

In the recent times you should be hearing a lot on map-reduce. I first heard of the term in last year Codebits. Although I wasn’t there, there was a talk with that title. I confess that knowing that map and reduce are common functional operators on different programming languages, I did not look to the talk abstract. During this year Yet Another Perl Workshop Europe, in Pisa, I saw a book on Hadoop, asked what it was about to a friend that wanted to buy it, and he said: a framework to implement Map-Reduce.
 
That made me think.. wait.. this should be the name of something different from what I though it was. Looking deeper I understood the concept. Googling, I found Google filled the patent request in 2004, and patented it in 2010. Found also that I used that construct in 2007, and documented it on my PhD thesis in 2008. Of course I did not call it Map-Reduce. In fact I did not call it anything fancy. It was just a way to get to results. Named it as my “divide and conquer approach”. And I did not heard of Google approach as well. I just got to it because I needed some results.
 
So, this is yet another reason why I hate software patents.

TEI – Well Done!

I will not detail anything about TEI. Sorry. I would just like to let you know that every time I need to work with any TEI subset, I find myself amazed with the quality of their documentation and the details they thought on before writing the standard.

Sometimes I just get to me thinking… do I really need all this stuff? The common answer is, no, I do not need so much detail on my annotations.

But that doesn’t mean I should not use TEI. Probably I should look to the section about the items I am trying to annotate and meditate. Probably I will not need the amount of different tags and details that are defined by TEI. But I am almost sure I will find one or two that I did not thought about. Then, I can use the portion of TEI I really want and forget about the rest. Probably my document will not validate against TEI, but probably it will not be too far away. And, probably, if someone else looks to the document, she will probably understand. And, if she don’t, I can always point to the TEI documentation and say: I am not using it all, but the subset I thought to be relevant.

Where am I using TEI? You can see it being used in the Dicionário-Aberto project, where the dictionary is encoded in a TEI subset. Also, I am looking to the TEI header and filtering it, making it an option to annotate documents on a parallel corpora project.

DBLP Bibliography Database and Scientific Publications in Portugal

In Portugal, Universities are rating researchers accordingly with their publications being or not cited on Internet articles databases like DBLP or ISI Web of Knowledge. Basically, if your article is not cited anywhere, then your article is class C. If it is cited in DBLP, it is class B. Finally, if it is present in ISI Web of Knowledge, it is classified as class A.

That is, if you can persuade DBLP author to publish the information about a conference or a journal, you can get your article to be rated B. Then, if a commercial company includes your article (that is, ISI Web of Knowledge), then you can get a class A article.

I wonder how a single guy (Michael Ley is doing a great job, that is not the problem) can find out if a journal is good or not for all areas. I do not know what Michael researches about, but I do not agree he can discern what conferences or journals are good for Parallel Computation, Natural Language Processing, Bio-Informatics, Artificial Intelligence, etc, etc.

Also, I wonder why there is a journal with a single issued published in DBLP, and without all articles listed. Yes, there is a journal that has more than thirty issues. Only one is in DBLP. And that one is not complete. Just half the articles are listed.

Yes, I tried a couple of times (in fact, more than four times) to send the full information about that journal and offered myself to add the BibTeX entry for all journal issues. Never ever got an answer.

The same happened when I sent (twice) the index for a journal on Natural Language Processing for the Iberian Languages. No answer at all. Is it because it is  bad journal? Probably. But I do not think my mails where read at all.

I can do similar comments about ISI Web of Knowledge. Why is a company maintaining this index? Why is this index paid? If a journal or conference pay for its inclusion, do you think the company will reply that it does not have enough quality to be listed?

More questions can be made. Check the number of conferences or journals on computer architecture. Then, check the number of conferences or journals in Natural Language Processing. Then, check the number of indexed conferences or journals in these areas. Yes, it is easier to be a GOOD researcher in computer architecture than in Natural Language Processing. Go figure why…